aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Get rid of SELinux 'personality-8' errors.Luden2016-04-301-1/+4
|
* Input: add infrastructure for selecting clockid for event time stampsJohn Stultz2016-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | As noted by Arve and others, since wall time can jump backwards, it is difficult to use for input because one cannot determine if one event occurred before another or for how long a key was pressed. However, the timestamp field is part of the kernel ABI, and cannot be changed without possibly breaking existing users. This patch adds a new IOCTL that allows a clockid to be set in the evdev_client struct that will specify which time base to use for event timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME). For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but in the future we could support other clockids if appropriate. The default remains CLOCK_REALTIME, so we don't change the ABI. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Daniel Kurtz <djkurtz@google.com> Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Conflicts: include/linux/input.h
* Fix security issues reported by the android.security CTSStevan Marinkovic2016-03-111-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | package Contains following upstream changes: a134f083e79fb4c3d0a925691e732c56911b4326 ipv4: Missing sk_nulls_node_init() in ping_unhash(). 8176cced706b5e5d15887584150764894e94e02f perf: Treat attr.config as u64 in perf_swevent_init() e9c243a5a6de0be8e584c604d353412584b592f8 futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid... 6f7b0a2a5c0fb03be7c25bd1745baa50582348ef futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi() Fixes CTS tests: android.security.cts.NativeCodeTest#testFutex android.security.cts.NativeCodeTest#testPerfEvent android.security.cts.NativeCodeTest#testPingPongRoot Change-Id: Ib9d389c875935e9eb9611be4fc11911383f627fc
* proc: fix build broken by proc inode per namespace patchJin Qian2016-03-111-0/+1
| | | | Change-Id: I119e4f31584b4a7ab9d6825499947d59c1293f1b
* proc: Usable inode numbers for the namespace file descriptors.Eric W. Biederman2016-03-115-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | Assign a unique proc inode to each namespace, and use that inode number to ensure we only allocate at most one proc inode for every namespace in proc. A single proc inode per namespace allows userspace to test to see if two processes are in the same namespace. This has been a long requested feature and only blocked because a naive implementation would put the id in a global space and would ultimately require having a namespace for the names of namespaces, making migration and certain virtualization tricks impossible. We still don't have per superblock inode numbers for proc, which appears necessary for application unaware checkpoint/restart and migrations (if the application is using namespace file descriptors) but that is now allowd by the design if it becomes important. I have preallocated the ipc and uts initial proc inode numbers so their structures can be statically initialized. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> (cherry picked from commit 98f842e675f96ffac96e6c50315790912b2812be)
* vfs: Add a user namespace reference from struct mnt_namespaceEric W. Biederman2016-03-111-1/+1
| | | | | | | | This will allow for support for unprivileged mounts in a new user namespace. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> (cherry picked from commit 771b1371686e0a63e938ada28de020b9a0040f55)
* VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errorsDavid Howells2016-03-111-5/+5
| | | | | | | | | | | | | | | | | | | copy_tree() can theoretically fail in a case other than ENOMEM, but always returns NULL which is interpreted by callers as -ENOMEM. Change it to return an explicit error. Also change clone_mnt() for consistency and because union mounts will add new error cases. Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix. [AV: folded braino fix by Dan Carpenter] Original-author: Valerie Aurora <vaurora@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Valerie Aurora <valerie.aurora@gmail.com> Cc: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit be34d1a3bc4b6f357a49acb55ae870c81337e4f0)
* brlocks/lglocks: turn into functionsAndi Kleen2016-03-112-1/+90
| | | | | | | | | | | | | | | | | | | | | | | lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason to not just use common utility functions and put all the data into a common data structure. Since there are at least two users it makes sense to share this code in a library. This is also easier maintainable than a macro forest. This will also make it later possible to dynamically allocate lglocks and also use them in modules (this would both still need some additional, but now straightforward, code) [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit eea62f831b8030b0eeea8314eed73b6132d1de26)
* mm: fix prctl_set_vma_anon_nameColin Cross2016-03-111-1/+1
| | | | | | | | | | | prctl_set_vma_anon_name could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: Ie32d8ddb0fd547efbeedd6528acdab5ca5b308b4 Reported-by: Jed Davis <jld@mozilla.com> Signed-off-by: Colin Cross <ccross@android.com>
* mm: add a field to store names for private anonymous memoryColin Cross2016-03-111-0/+145
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Conflicts: include/linux/prctl.h kernel/sys.c
* security: remove the security_netlink_recv hook as it is equivalent to capable()Eric Paris2016-03-111-2/+2
| | | | | | | | | | Once upon a time netlink was not sync and we had to get the effective capabilities from the skb that was being received. Today we instead get the capabilities from the current task. This has rendered the entire purpose of the hook moot as it is now functionally equivalent to the capable() call. Signed-off-by: Eric Paris <eparis@redhat.com>
* cgroup: fix exit() vs rmdir() raceLi Zefan2016-03-111-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 71b5707e119653039e6e95213f00479668c79b75 upstream. In cgroup_exit() put_css_set_taskexit() is called without any lock, which might lead to accessing a freed cgroup: thread1 thread2 --------------------------------------------- exit() cgroup_exit() put_css_set_taskexit() atomic_dec(cgrp->count); rmdir(); /* not safe !! */ check_for_release(cgrp); rcu_read_lock() can be used to make sure the cgroup is alive. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: kernel/cgroup.c
* cgroup: Fix use after free of cgrp (cgrp->css_sets)Hans de Goede2016-03-111-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a 3.4 kernel + Fedora-18 (systemd) userland on my Allwinner A10 (arm cortex a8), I'm seeing repeated, reproducable list_del list corruption errors when build with CONFIG_DEBUG_LIST, and the backtrace always shows free_css_set_work as the function making the problematic list_del call. I've tracked this doen to a use after free of the cgrp struct, specifically of the cgrp->css_sets list_head, which gets cleared by free_css_set_work. Since free_css_set_work runs form a workqueue, it is possible for it to not be done with clearing the list when the cgrp gets free-ed. To avoid this the code adding the links increases cgrp->count, and the freeing code running from the workqueue decreases cgrp->count *after* doing list_del, and then if the count goes to 0 calls cgroup_wakeup_rmdir_waiter(). However cgroup_rmdir() is missing a check for cgrp->count != 0, causing it to still continue with the rmdir (which leads to the free-ing of the cgrp), before free_css_set_work is done. Sometimes the free-ed memory is re-used before free_css_set_work gets around to unlinking link->cgrp_link_list, triggering the list_del list corruption messages. This patch fixes this by properly checking for cgrp->count != 0 and waiting for the cgroup_rmdir_waitq in that case. Change-Id: I9dbc02a0a75d5dffa1b65d67456e00139dea57c3 Signed-off-by: Hans de Goede <hdegoede@redhat.com>
* cgroup: Take css_set_lock from cgroup_css_sets_empty()Hans de Goede2016-03-111-5/+9
| | | | | | | | | As indicated in the comment above cgroup_css_sets_empty it needs the css_set_lock. But neither of the 2 call points have it, so rather then fixing the callers just take the lock inside cgroup_css_sets_empty(). Signed-off-by: Hans de Goede <hdegoede@redhat.com> Change-Id: If7aea71824f6d0e3f2cc6c1ce236c3ae6be2037b
* Power: Add guard condition for maximum wakeup reasonsRuchi Kandoi2016-03-111-0/+7
| | | | | | | Ensure the array for the wakeup reason IRQs does not overflow. Change-Id: Iddc57a3aeb1888f39d4e7b004164611803a4d37c Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.Ruchi Kandoi2016-03-111-0/+3
| | | | | | | | | Adds a capable() check to make sure that arbitary apps do not change the timer slack for other apps. Bug: 15000427 Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* prctl: adds PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary ↵Ruchi Kandoi2016-03-111-0/+19
| | | | | | | | | | | | | | | | | | | | thread. Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the slack is set to that value otherwise sets it to the default for the thread. Takes PID of the thread as the third argument. This allows power/performance management software to set timer slack for other threads according to its policy for the thread (such as when the thread is designated foreground vs. background activity) Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com> Conflicts: include/linux/prctl.h kernel/sys.c
* Power: Changes the permission to read only for sysfs fileRuchi Kandoi2016-03-111-3/+2
| | | | | | | /sys/kernel/wakeup_reasons/last_resume_reason Change-Id: If25e8e416ee9726996518b58b6551a61dc1591e3 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* POWER: fix compile warnings in log_wakeup_reasonRuchi Kandoi2016-03-111-3/+4
| | | | | | | | Change I81addaf420f1338255c5d0638b0d244a99d777d1 introduced compile warnings, fix these. Change-Id: I05482a5335599ab96c0a088a7d175c8d4cf1cf69 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* Power: add an API to log wakeup reasonsRuchi Kandoi2016-03-112-0/+134
| | | | | | | | Add API log_wakeup_reason() and expose it to userspace via sysfs path /sys/kernel/wakeup_reasons/last_resume_reason Change-Id: I81addaf420f1338255c5d0638b0d244a99d777d1 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* add extra free kbytes tunableRik van Riel2016-03-111-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. [ccross] Revived for use on old kernels where no other solution exists. The tunable will be removed on kernels that do better at avoiding direct reclaim. Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e Signed-off-by: Rik van Riel<riel@redhat.com> Signed-off-by: Colin Cross <ccross@android.com>
* oom: remove oom_disable_countDavid Rientjes2016-03-112-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()Devin Kim2016-03-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These 2 syncronize_rcu()s make attaching a task to a cgroup quite slow, and it can't be ignored in some situations. A real case from Colin Cross: Android uses cgroups heavily to manage thread priorities, putting threads in a background group with reduced cpu.shares when they are not visible to the user, and in a foreground group when they are. Some RPCs from foreground threads to background threads will temporarily move the background thread into the foreground group for the duration of the RPC. This results in many calls to cgroup_attach_task. In cgroup_attach_task() it's task->cgroups that is protected by RCU, and put_css_set() calls kfree_rcu() to free it. If we remove this synchronize_rcu(), there can be threads in RCU-read sections accessing their old cgroup via current->cgroups with concurrent rmdir operation, but this is safe. # time for ((i=0; i<50; i++)) { echo $$ > /mnt/sub/tasks; echo $$ > /mnt/tasks; } real 0m2.524s user 0m0.008s sys 0m0.004s With this patch: real 0m0.004s user 0m0.004s sys 0m0.000s tj: These synchronize_rcu()s are utterly confused. synchornize_rcu() necessarily has to come between two operations to guarantee that the changes made by the former operation are visible to all rcu readers before proceeding to the latter operation. Here, synchornize_rcu() are at the end of attach operations with nothing beyond it. Its only effect would be delaying completion of write(2) to sysfs tasks/procs files until all rcu readers see the change, which doesn't mean anything. cherry-picked from: https://android.googlesource.com/kernel/common/+/5d65bc0ca1bceb73204dab943922ba3c83276a8c Bug: 17709419 Change-Id: I98dacd6c13da27cb3496fe4a24a24084e46bdd9c Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Colin Cross <ccross@google.com> Signed-off-by: Devin Kim <dojip.kim@lge.com>
* Merge remote-tracking branch 'linux-stable/linux-3.0.y' into ↵Ziyan2015-10-2554-578/+1203
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | p-android-omap-3.0-dev-espresso Conflicts: Makefile arch/arm/include/asm/hardware/cache-l2x0.h arch/arm/kernel/smp.c arch/arm/mach-omap2/board-4430sdp.c arch/arm/mach-omap2/board-omap4panda.c arch/arm/mach-omap2/opp.c arch/ia64/include/asm/futex.h drivers/bluetooth/ath3k.c drivers/bluetooth/btusb.c drivers/firmware/efivars.c drivers/gpu/drm/i915/intel_lvds.c drivers/gpu/drm/radeon/radeon_atombios.c drivers/gpu/drm/radeon/radeon_irq_kms.c drivers/hwmon/fam15h_power.c drivers/mfd/twl6030-irq.c drivers/mmc/core/sdio.c drivers/net/tun.c drivers/net/usb/ipheth.c drivers/net/usb/usbnet.c drivers/usb/core/hub.c drivers/usb/host/xhci-mem.c drivers/usb/host/xhci.h drivers/usb/musb/omap2430.c drivers/usb/serial/ftdi_sio.c drivers/usb/serial/ftdi_sio_ids.h drivers/usb/serial/option.c drivers/usb/serial/qcserial.c drivers/usb/serial/ti_usb_3410_5052.c drivers/usb/serial/ti_usb_3410_5052.h drivers/video/omap2/dss/hdmi.c fs/splice.c include/asm-generic/pgtable.h include/net/sch_generic.h kernel/cgroup.c kernel/futex.c kernel/time/timekeeping.c net/ipv4/route.c net/ipv4/syncookies.c net/ipv4/tcp_ipv4.c net/wireless/util.c security/commoncap.c sound/soc/soc-dapm.c
| * splice: fix racy pipe->buffers usesEric Dumazet2013-10-052-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 047fe3605235888f3ebcda0c728cb31937eadfe6 upstream. Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Cc: stable <stable@vger.kernel.org> # 2.6.35 Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Fix perf_cgroup_switch for sw-eventsPeter Zijlstra2013-10-011-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 95cf59ea72331d0093010543b8951bb43f262cac upstream. Jiri reported that he could trigger the WARN_ON_ONCE() in perf_cgroup_switch() using sw-events. This is because sw-events share a cpuctx with multiple PMUs. Use the ->unique_pmu pointer to limit the pmu iteration to unique cpuctx instances. Reported-and-Tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-so7wi2zf3jjzrwcutm2mkz0j@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmuPeter Zijlstra2013-10-011-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3f1f33206c16c7b3839d71372bc2ac3f305aa802 upstream. Stephane thought the perf_cpu_context::active_pmu name confusing and suggested using 'unique_pmu' instead. This pointer is a pointer to a 'random' pmu sharing the cpuctx instance, therefore limiting a for_each_pmu loop to those where cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx instances. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * cgroup: fail if monitored file and event_control are in different cgroupLi Zefan2013-10-011-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f169007b2773f285e098cb84c74aac0154d65ff7 upstream. If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control of cgroup B, then we won't get memory usage notification from A but B! What's worse, if A and B are in different mount hierarchy, we'll end up accessing NULL pointer! Disallow this kind of invalid usage. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Weng Meiling <wengmeiling.weng@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * futex: Take hugepages into account when generating futex_keyZhang Yi2013-08-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 13d60f4b6ab5b702dc8d2ee20999f98a93728aec upstream. The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: 'Mel Gorman' <mgorman@suse.de> Acked-by: 'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Fix fields of struct trace_iterator that are zeroed by mistakeAndrew Vagin2013-08-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ed5467da0e369e65b247b99eb6403cb79172bcda upstream. tracing_read_pipe zeros all fields bellow "seq". The declaration contains a comment about that, but it doesn't help. The first field is "snapshot", it's true when current open file is snapshot. Looks obvious, that it should not be zeroed. The second field is "started". It was converted from cpumask_t to cpumask_var_t (v2.6.28-4983-g4462344), in other words it was converted from cpumask to pointer on cpumask. Currently the reference on "started" memory is lost after the first read from tracing_read_pipe and a proper object will never be freed. The "started" is never dereferenced for trace_pipe, because trace_pipe can't have the TRACE_FILE_ANNOTATE options. Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Use css_tryget() to avoid propping up css refcountSalman Qazi2013-08-111-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9c5da09d266ca9b32eb16cf940f8161d949c2fe5 upstream. An rmdir pushes css's ref count to zero. However, if the associated directory is open at the time, the dentry ref count is non-zero. If the fd for this directory is then passed into perf_event_open, it does a css_get(). This bounces the ref count back up from zero. This is a problem by itself. But what makes it turn into a crash is the fact that we end up doing an extra dput, since we perform a dput when css_put sees the ref count go down to zero. css_tryget() does not fall into that trap. So, we use that instead. Reproduction test-case for the bug: #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <linux/unistd.h> #include <linux/perf_event.h> #include <string.h> #include <errno.h> #include <stdio.h> #define PERF_FLAG_PID_CGROUP (1U << 2) int perf_event_open(struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu, group_fd, flags); } /* * Directly poke at the perf_event bug, since it's proving hard to repro * depending on where in the kernel tree. what moved? */ int main(int argc, char **argv) { int fd; struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.exclude_kernel = 1; attr.size = sizeof(attr); mkdir("/dev/cgroup/perf_event/blah", 0777); fd = open("/dev/cgroup/perf_event/blah", O_RDONLY); perror("open"); rmdir("/dev/cgroup/perf_event/blah"); sleep(2); perf_event_open(&attr, fd, 0, -1, PERF_FLAG_PID_CGROUP); perror("perf_event_open"); close(fd); return 0; } Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Fix event group context moveJiri Olsa2013-08-111-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0231bb5336758426b44ccd798ccd3c5419c95d58 upstream. When we have group with mixed events (hw/sw) we want to end up with group leader being in hw context. So if group leader is initialy sw event, we move all the events under hw context. The move is done for each event by removing it from its context and adding it back into proper one. As a part of the removal the event is automatically disabled, which is not what we want at this stage of creating groups. The fix is to initialize event state after removal from sw context. This fix resulted from the following discussion: http://thread.gmane.org/gmane.linux.kernel.perf.user/1144 Reported-by: Andreas Hollmann <hollmann@in.tum.de> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vince@deater.net> Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * sched: Fix the broken sched_rr_get_interval()Zhu Yanhai2013-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a59f4e079d19464eebb9b06513a1d4f55fdae5ba upstream. The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Fix irqs-off tag display in syscall tracingzhangwei(Jovi)2013-08-041-0/+3
| | | | | | | | | | | | | | | | | | | | commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream Initialization of variable irq_flags and pc was missed when backport 11034ae9c to linux-3.0.y and linux-3.4.y, my fault. Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * hrtimers: Move SMP function call to thread contextThomas Gleixner2013-07-281-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5ec2481b7b47a4005bb446d176e5d0257400c77d upstream. smp_call_function_* must not be called from softirq context. But clock_was_set() which calls on_each_cpu() is called from softirq context to implement a delayed clock_was_set() for the timer interrupt handler. Though that almost never gets invoked. A recent change in the resume code uses the softirq based delayed clock_was_set to support Xens resume mechanism. linux-next contains a new warning which warns if smp_call_function_* is called from softirq context which gets triggered by that Xen change. Fix this by moving the delayed clock_was_set() call to a work context. Reported-and-tested-by: Artem Savkov <artem.savkov@gmail.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com>, Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: John Stultz <john.stultz@linaro.org> Cc: xen-devel@lists.xen.org Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Fix irqs-off tag display in syscall tracingzhangwei(Jovi)2013-07-281-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream. All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't disable irqs. [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6) The reason is syscall tracing doesn't record irq_flags into buffer. The proper display is: [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 14/14 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) <...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0) Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com Cc: stable@vger.kernel.org # 2.6.35 Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Fix perf_lock_task_context() vs RCUPeter Zijlstra2013-07-281-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 058ebd0eba3aff16b144eabf4510ed9510e1416e upstream. Jiri managed to trigger this warning: [] ====================================================== [] [ INFO: possible circular locking dependency detected ] [] 3.10.0+ #228 Tainted: G W [] ------------------------------------------------------- [] p/6613 is trying to acquire lock: [] (rcu_node_0){..-...}, at: [<ffffffff810ca797>] rcu_read_unlock_special+0xa7/0x250 [] [] but task is already holding lock: [] (&ctx->lock){-.-...}, at: [<ffffffff810f2879>] perf_lock_task_context+0xd9/0x2c0 [] [] which lock already depends on the new lock. [] [] the existing dependency chain (in reverse order) is: [] [] -> #4 (&ctx->lock){-.-...}: [] -> #3 (&rq->lock){-.-.-.}: [] -> #2 (&p->pi_lock){-.-.-.}: [] -> #1 (&rnp->nocb_gp_wq[1]){......}: [] -> #0 (rcu_node_0){..-...}: Paul was quick to explain that due to preemptible RCU we cannot call rcu_read_unlock() while holding scheduler (or nested) locks when part of the read side critical section was preemptible. Therefore solve it by making the entire RCU read side non-preemptible. Also pull out the retry from under the non-preempt to play nice with RT. Reported-by: Jiri Olsa <jolsa@redhat.com> Helped-out-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenarioJiri Olsa2013-07-281-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 06f417968beac6e6b614e17b37d347aa6a6b1d30 upstream. The '!ctx->is_active' check has a valid scenario, so there's no need for the warning. The reason is that there's a time window between the 'ctx->is_active' check in the perf_event_enable() function and the __perf_event_enable() function having: - IRQs on - ctx->lock unlocked where the task could be killed and 'ctx' deactivated by perf_event_exit_task(), ending up with the warning below. So remove the WARN_ON_ONCE() check and add comments to explain it all. This addresses the following warning reported by Vince Weaver: [ 324.983534] ------------[ cut here ]------------ [ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190() [ 324.984420] Modules linked in: [ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246 [ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00 [ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860 [ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a [ 324.984420] Call Trace: [ 324.984420] <IRQ> [<ffffffff8160ea0b>] dump_stack+0x19/0x1b [ 324.984420] [<ffffffff81080ff0>] warn_slowpath_common+0x70/0xa0 [ 324.984420] [<ffffffff8108103a>] warn_slowpath_null+0x1a/0x20 [ 324.984420] [<ffffffff81134437>] __perf_event_enable+0x187/0x190 [ 324.984420] [<ffffffff81130030>] remote_function+0x40/0x50 [ 324.984420] [<ffffffff810e51de>] generic_smp_call_function_single_interrupt+0xbe/0x130 [ 324.984420] [<ffffffff81066a47>] smp_call_function_single_interrupt+0x27/0x40 [ 324.984420] [<ffffffff8161fd2f>] call_function_single_interrupt+0x6f/0x80 [ 324.984420] <EOI> [<ffffffff816161a1>] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 324.984420] [<ffffffff8113799d>] perf_event_exit_task+0x14d/0x210 [ 324.984420] [<ffffffff810acd04>] ? switch_task_namespaces+0x24/0x60 [ 324.984420] [<ffffffff81086946>] do_exit+0x2b6/0xa40 [ 324.984420] [<ffffffff8161615c>] ? _raw_spin_unlock_irq+0x2c/0x30 [ 324.984420] [<ffffffff81087279>] do_group_exit+0x49/0xc0 [ 324.984420] [<ffffffff81096854>] get_signal_to_deliver+0x254/0x620 [ 324.984420] [<ffffffff81043057>] do_signal+0x57/0x5a0 [ 324.984420] [<ffffffff8161a164>] ? __do_page_fault+0x2a4/0x4e0 [ 324.984420] [<ffffffff8161665c>] ? retint_restore_args+0xe/0xe [ 324.984420] [<ffffffff816166cd>] ? retint_signal+0x11/0x84 [ 324.984420] [<ffffffff81043605>] do_notify_resume+0x65/0x80 [ 324.984420] [<ffffffff81616702>] retint_signal+0x46/0x84 [ 324.984420] ---[ end trace 442ec2f04db3771a ]--- Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf: Clone child context from parent context pmuJiri Olsa2013-07-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 734df5ab549ca44f40de0f07af1c8803856dfb18 upstream. Currently when the child context for inherited events is created, it's based on the pmu object of the first event of the parent context. This is wrong for the following scenario: - HW context having HW and SW event - HW event got removed (closed) - SW event stays in HW context as the only event and its pmu is used to clone the child context The issue starts when the cpu context object is touched based on the pmu context object (__get_cpu_context). In this case the HW context will work with SW cpu context ending up with following WARN below. Fixing this by using parent context pmu object to clone from child context. Addresses the following warning reported by Vince Weaver: [ 2716.472065] ------------[ cut here ]------------ [ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x) [ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn [ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2 [ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2 [ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18 [ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad [ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550 [ 2716.476035] Call Trace: [ 2716.476035] [<ffffffff8102e215>] ? warn_slowpath_common+0x5b/0x70 [ 2716.476035] [<ffffffff810ab2bd>] ? task_ctx_sched_out+0x3c/0x5f [ 2716.476035] [<ffffffff810af02a>] ? perf_event_exit_task+0xbf/0x194 [ 2716.476035] [<ffffffff81032a37>] ? do_exit+0x3e7/0x90c [ 2716.476035] [<ffffffff810cd5ab>] ? __do_fault+0x359/0x394 [ 2716.476035] [<ffffffff81032fe6>] ? do_group_exit+0x66/0x98 [ 2716.476035] [<ffffffff8103dbcd>] ? get_signal_to_deliver+0x479/0x4ad [ 2716.476035] [<ffffffff810ac05c>] ? __perf_event_task_sched_out+0x230/0x2d1 [ 2716.476035] [<ffffffff8100205d>] ? do_signal+0x3c/0x432 [ 2716.476035] [<ffffffff810abbf9>] ? ctx_sched_in+0x43/0x141 [ 2716.476035] [<ffffffff810ac2ca>] ? perf_event_context_sched_in+0x7a/0x90 [ 2716.476035] [<ffffffff810ac311>] ? __perf_event_task_sched_in+0x31/0x118 [ 2716.476035] [<ffffffff81050dd9>] ? mmdrop+0xd/0x1c [ 2716.476035] [<ffffffff81051a39>] ? finish_task_switch+0x7d/0xa6 [ 2716.476035] [<ffffffff81002473>] ? do_notify_resume+0x20/0x5d [ 2716.476035] [<ffffffff813654f5>] ? retint_signal+0x3d/0x78 [ 2716.476035] ---[ end trace 827178d8a5966c3d ]--- Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Use current_uid() for critical time tracingSteven Rostedt (Red Hat)2013-07-281-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f17a5194859a82afe4164e938b92035b86c55794 upstream. The irqsoff tracer records the max time that interrupts are disabled. There are hooks in the assembly code that calls back into the tracer when interrupts are disabled or enabled. When they are enabled, the tracer checks if the amount of time they were disabled is larger than the previous recorded max interrupts off time. If it is, it creates a snapshot of the currently running trace to store where the last largest interrupts off time was held and how it happened. During testing, this RCU lockdep dump appeared: [ 1257.829021] =============================== [ 1257.829021] [ INFO: suspicious RCU usage. ] [ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W [ 1257.829021] ------------------------------- [ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle! [ 1257.829021] [ 1257.829021] other info that might help us debug this: [ 1257.829021] [ 1257.829021] [ 1257.829021] RCU used illegally from idle CPU! [ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0 [ 1257.829021] RCU used illegally from extended quiescent state! [ 1257.829021] 2 locks held by trace-cmd/4831: [ 1257.829021] #0: (max_trace_lock){......}, at: [<ffffffff810e2b77>] stop_critical_timing+0x1a3/0x209 [ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff810dae5a>] __update_max_tr+0x88/0x1ee [ 1257.829021] [ 1257.829021] stack backtrace: [ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171 [ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 [ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8 [ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003 [ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a [ 1257.829021] Call Trace: [ 1257.829021] [<ffffffff8153dd2b>] dump_stack+0x19/0x1b [ 1257.829021] [<ffffffff81092a00>] lockdep_rcu_suspicious+0x109/0x112 [ 1257.829021] [<ffffffff810daebf>] __update_max_tr+0xed/0x1ee [ 1257.829021] [<ffffffff810dae5a>] ? __update_max_tr+0x88/0x1ee [ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107 [ 1257.829021] [<ffffffff810dbf85>] update_max_tr_single+0x11d/0x12d [ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107 [ 1257.829021] [<ffffffff810e2b15>] stop_critical_timing+0x141/0x209 [ 1257.829021] [<ffffffff8109569a>] ? trace_hardirqs_on+0xd/0xf [ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107 [ 1257.829021] [<ffffffff810e3057>] time_hardirqs_on+0x2a/0x2f [ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107 [ 1257.829021] [<ffffffff8109550c>] trace_hardirqs_on_caller+0x16/0x197 [ 1257.829021] [<ffffffff8109569a>] trace_hardirqs_on+0xd/0xf [ 1257.829021] [<ffffffff811002b9>] user_enter+0xfd/0x107 [ 1257.829021] [<ffffffff810029b4>] do_notify_resume+0x92/0x97 [ 1257.829021] [<ffffffff8154bdca>] int_signal+0x12/0x17 What happened was entering into the user code, the interrupts were enabled and a max interrupts off was recorded. The trace buffer was saved along with various information about the task: comm, pid, uid, priority, etc. The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock() to retrieve the data, and this happened to happen where RCU is blind (user_enter). As only the preempt and irqs off tracers can have this happen, and they both only have the tsk == current, if tsk == current, use current_uid() instead of task_uid(), as current_uid() does not use RCU as only current can change its uid. This fixes the RCU suspicious splat. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tick: Prevent uncontrolled switch to oneshot modeThomas Gleixner2013-07-281-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1f73a9806bdd07a5106409bbcab3884078bd34fe upstream. When the system switches from periodic to oneshot mode, the broadcast logic causes a possibility that a CPU which has not yet switched to oneshot mode puts its own clock event device into oneshot mode without updating the state and the timer handler. CPU0 CPU1 per cpu tickdev is in periodic mode and switched to broadcast Switch to oneshot mode tick_broadcast_switch_to_oneshot() cpumask_copy(tick_oneshot_broacast_mask, tick_broadcast_mask); broadcast device mode = oneshot Timer interrupt irq_enter() tick_check_oneshot_broadcast() dev->set_mode(ONESHOT); tick_handle_periodic() if (dev->mode == ONESHOT) dev->next_event += period; FAIL. We fail, because dev->next_event contains KTIME_MAX, if the device was in periodic mode before the uncontrolled switch to oneshot happened. We must copy the broadcast bits over to the oneshot mask, because otherwise a CPU which relies on the broadcast would not been woken up anymore after the broadcast device switched to oneshot mode. So we need to verify in tick_check_oneshot_broadcast() whether the CPU has already switched to oneshot mode. If not, leave the device untouched and let the CPU switch controlled into oneshot mode. This is a long standing bug, which was never noticed, because the main user of the broadcast x86 cannot run into that scenario, AFAICT. The nonarchitected timer mess of ARM creates a gazillion of differently broken abominations which trigger the shortcomings of that broadcast code, which better had never been necessary in the first place. Reported-and-tested-by: Stehle Vincent-B46079 <B46079@freescale.com> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Cc: John Stultz <john.stultz@linaro.org>, Cc: Mark Rutland <mark.rutland@arm.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * timer: Fix jiffies wrap behavior of round_jiffies_common()Bart Van Assche2013-07-211-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9e04d3804d3ac97d8c03a41d78d0f0674b5d01e1 upstream. Direct compare of jiffies related values does not work in the wrap around case. Replace it with time_is_after_jiffies(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/519BC066.5080600@acm.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * genirq: Fix can_request_irq() for IRQs without an actionBen Hutchings2013-07-211-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2779db8d37d4b542d9ca2575f5f178dbeaca6c86 upstream. Commit 02725e7471b8 ('genirq: Use irq_get/put functions'), inadvertently changed can_request_irq() to return 0 for IRQs that have no action. This causes pcibios_lookup_irq() to select only IRQs that already have an action with IRQF_SHARED set, or to fail if there are none. Change can_request_irq() to return 1 for IRQs that have no action (if the first two conditions are met). Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2) Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: 709647@bugs.debian.org Link: http://bugs.debian.org/709647 Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()Oleg Nesterov2013-07-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c790b0ad23f427c7522ffed264706238c57c007e upstream. fetch_bp_busy_slots() and toggle_bp_slot() use for_each_online_cpu(), this is obviously wrong wrt cpu_up() or cpu_down(), we can over/under account the per-cpu numbers. For example: # echo 0 >> /sys/devices/system/cpu/cpu1/online # perf record -e mem:0x10 -p 1 & # echo 1 >> /sys/devices/system/cpu/cpu1/online # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a & # taskset -p 0x2 1 triggers the same WARN_ONCE("Can't find any breakpoint slot") in arch_install_hw_breakpoint(). Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE sectionSteven Rostedt2013-06-131-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7f49ef69db6bbf756c0abca7e9b65b32e999eec8 upstream. As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the ftrace_pid_fops is defined when DYNAMIC_FTRACE is not. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung@kernel.org> [ lizf: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Fix possible NULL pointer dereferencesNamhyung Kim2013-06-131-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6a76f8c0ab19f215af2a3442870eeb5f0e81998d upstream. Currently set_ftrace_pid and set_graph_function files use seq_lseek for their fops. However seq_open() is called only for FMODE_READ in the fops->open() so that if an user tries to seek one of those file when she open it for writing, it sees NULL seq_file and then panic. It can be easily reproduced with following command: $ cd /sys/kernel/debug/tracing $ echo 1234 | sudo tee -a set_ftrace_pid In this example, GNU coreutils' tee opens the file with fopen(, "a") and then the fopen() internally calls lseek(). Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [ lizf: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * usermodehelper: check subprocess_info->path != NULLOleg Nesterov2013-05-191-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 264b83c07a84223f0efd0d1db9ccc66d6f88288f upstream. argv_split(empty_or_all_spaces) happily succeeds, it simply returns argc == 0 and argv[0] == NULL. Change call_usermodehelper_exec() to check sub_info->path != NULL to avoid the crash. This is the minimal fix, todo: - perhaps we should change argv_split() to return NULL or change the callers. - kill or justify ->path[0] check - narrow the scope of helper_lock() Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-By: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tick: Cleanup NOHZ per cpu data on cpu downThomas Gleixner2013-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4b0c0f294f60abcdd20994a8341a95c8ac5eeb96 upstream. Prarit reported a crash on CPU offline/online. The reason is that on CPU down the NOHZ related per cpu data of the dead cpu is not cleaned up. If at cpu online an interrupt happens before the per cpu tick device is registered the irq_enter() check potentially sees stale data and dereferences a NULL pointer. Cleanup the data after the cpu is dead. Reported-by: Prarit Bhargava <prarit@redhat.com> Cc: Mike Galbraith <bitbucket@online.de> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARETirupathi Reddy2013-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 42a5cf46cd56f46267d2a9fcf2655f4078cd3042 upstream. An inactive timer's base can refer to a offline cpu's base. In the current code, cpu_base's lock is blindly reinitialized each time a CPU is brought up. If a CPU is brought online during the period that another thread is trying to modify an inactive timer on that CPU with holding its timer base lock, then the lock will be reinitialized under its feet. This leads to following SPIN_BUG(). <0> BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466 <0> lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1 <4> [<c0013dc4>] (unwind_backtrace+0x0/0x11c) from [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) <4> [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) from [<c076c160>] (_raw_spin_unlock+0x8/0x30) <4> [<c076c160>] (_raw_spin_unlock+0x8/0x30) from [<c009b858>] (mod_timer+0x294/0x310) <4> [<c009b858>] (mod_timer+0x294/0x310) from [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) <4> [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) from [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) <4> [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) from [<c04d8780>] (sdhci_disable+0x40/0x48) <4> [<c04d8780>] (sdhci_disable+0x40/0x48) from [<c04bf300>] (mmc_release_host+0x4c/0xb0) <4> [<c04bf300>] (mmc_release_host+0x4c/0xb0) from [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) <4> [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) from [<c04c2504>] (mmc_rescan+0x7c/0x2c4) <4> [<c04c2504>] (mmc_rescan+0x7c/0x2c4) from [<c00a6a7c>] (process_one_work+0x27c/0x484) <4> [<c00a6a7c>] (process_one_work+0x27c/0x484) from [<c00a6e94>] (worker_thread+0x210/0x3b0) <4> [<c00a6e94>] (worker_thread+0x210/0x3b0) from [<c00aad9c>] (kthread+0x80/0x8c) <4> [<c00aad9c>] (kthread+0x80/0x8c) from [<c000ea80>] (kernel_thread_exit+0x0/0x8) As an example, this particular crash occurred when CPU #3 is executing mod_timer() on an inactive timer whose base is refered to offlined CPU #2. The code locked the timer_base corresponding to CPU #2. Before it could proceed, CPU #2 came online and reinitialized the spinlock corresponding to its base. Thus now CPU #3 held a lock which was reinitialized. When CPU #3 finally ended up unlocking the old cpu_base corresponding to CPU #2, we hit the above SPIN_BUG(). CPU #0 CPU #3 CPU #2 ------ ------- ------- ..... ...... <Offline> mod_timer() lock_timer_base spin_lock_irqsave(&base->lock) cpu_up(2) ..... ...... init_timers_cpu() .... ..... spin_lock_init(&base->lock) ..... spin_unlock_irqrestore(&base->lock) ...... <spin_bug> Allocation of per_cpu timer vector bases is done only once under "tvec_base_done[]" check. In the current code, spinlock_initialization of base->lock isn't under this check. When a CPU is up each time the base lock is reinitialized. Move base spinlock initialization under the check. Signed-off-by: Tirupathi Reddy <tirupath@codeaurora.org> Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * kernel/audit_tree.c: tree will leak memory when failure occurs in ↵Chen Gang2013-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | audit_trim_trees() commit 12b2f117f3bf738c1a00a6f64393f1953a740bd4 upstream. audit_trim_trees() calls get_tree(). If a failure occurs we must call put_tree(). [akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement] Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>