| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 71458cfc782eafe4b27656e078d379a34e472adf upstream.
We're missing include/linux/compiler-gcc5.h which is required now
because gcc branched off to v5 in trunk.
Just copy the relevant bits out of include/linux/compiler-gcc4.h,
no new code is added as of now.
This fixes a build error when using gcc 5.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When calling rproc_get for the first time, the loading of
the remoteproc image will be requested using a non-blocking
request_firmware_no_wait, and the caller can continue before
the actual loading is complete.
The loader later can return an error due to a non-existing
or wrong image and there should be a way to notify about this
to users having a rproc handle.
This functionality is added and is leveraged by rpmsg to
release some resources it had already acquired since requesting
a firmware load.
Change-Id: I1d3523efbcfd613bca74d363084791ceaaaa9989
Signed-off-by: Miguel Vadillo <vadillo@ti.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Change-Id: I264f97d30d0a623011d9ee811c63fa0e0c2149a2
Fixes: 6859d49475d4 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a6138db815df5ee542d848318e5dae681590fccd upstream.
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.
Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.
Change-Id: I8ab8bda03a14b9b43e78f1dc6c818bbec048e986
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").
In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.
Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.
To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
Change-Id: Id9bec3722797dff7d0ff0d9f6097c4229e31fd62
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.
This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.
The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).
Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
Documentation/sysctl/fs.txt
fs/pipe.c
include/linux/sched.h
Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry pick from commit 8a5e5e02fc83aaf67053ab53b359af08c6c49aaf)
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space"
is located starting from 0x0. Given unprivileged users cannot mmap
anything below mmap_min_addr, it should be safe to use poison pointers
lower than mmap_min_addr.
The current poison pointer values of LIST_POISON{1,2} might be too big for
mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu
uses only 0x10000). There is little point to use such a big value given
the "poison pointer space" below 1 MB is not yet exhausted. Changing it
to a smaller value solves the problem for small mmap_min_addr setups.
The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6
Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26429468
Bug: 26186802
Bug: 26429519
Change-Id: Ic51614f6cc98e416282f19af96b9d116eff7c08b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When kernel.perf_event_open is set to 3 (or greater), disallow all
access to performance events by users without CAP_SYS_ADMIN.
Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
makes this value the default.
This is based on a similar feature in grsecurity
(CONFIG_GRKERNSEC_PERF_HARDEN). This version doesn't include making
the variable read-only. It also allows enabling further restriction
at run-time regardless of whether the default is changed.
https://lkml.org/lkml/2016/1/11/587
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Bug: 29054680
Change-Id: Iff5bff4fc1042e85866df9faa01bce8d04335ab8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Per RFC 6724, section 4, "Candidate Source Addresses":
It is RECOMMENDED that the candidate source addresses be the set
of unicast addresses assigned to the interface that will be used
to send to the destination (the "outgoing" interface).
Add a sysctl to enable this behaviour.
Signed-off-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Simplified back-port of net-next 3985e8a3611a93bb36789f65db862e5700aab65e]
Bug: 19470192
Bug: 21832279
Bug: 22464419
Change-Id: Icd96382f814a6f3ea53f05beb98c266b1929c5a3
|
|
|
|
|
| |
Change-Id: Ie8d4dea92c9bc9c5efe69886db05f93b3487f3a4
Signed-off-by: Kyle Repinski <repinski23@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry picked from commit https://lkml.org/lkml/2016/2/4/831)
d07e22597d1d355 ("mm: mmap: add new /proc tunable for mmap_base ASLR")
added the ability to choose from a range of values to use for entropy
count in generating the random offset to the mmap_base address. The
maximum value on this range was set to 32 bits for 64-bit x86 systems, but
this value could be increased further, requiring more than the 32 bits of
randomness provided by get_random_int(), as is already possible for arm64.
Add a new function: get_random_long() which more naturally fits with the
mmap usage of get_random_int() but operates exactly the same as
get_random_int().
Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
that values greater than 31 bits generate an appropriate mask without
overflow. This is especially important on x86, as its shift instruction
uses a 5-bit mask for the shift operand, which meant that any value for
mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
randomization.
Finally, replace calls to get_random_int() with get_random_long() where
appropriate.
Bug: 26963541
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ie7552631b5db86f3482cf15e7dc916d89c1c502b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ic74424e07710cd9ccb4a02871a829d14ef0cc4bc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the device is suspended with the running camera and then resumed,
it's possible that autosuspend counter won't reach zero and the main
condition in rproc_last_busy() won't be satisfied, so rproc won't be
restarted.
Fixed that by tracking the need to restart rproc separately and
independent of pm state.
Also cleaned up reference counting in rproc_put and rproc_get a bit
to make sure that forced offline and restart don't screw up the number
of references + won't result in race conditions.
|
|
|
|
|
|
| |
Added extra CMA debugging logging into FS, compaction, isolation
and migration code. This makes it easier to see which parts of the
kernel are responsible for the most migration failures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It looks like Linux pages migration code was never designed to be
deterministic or synchronous, there are multiple race conditions
between different parts of the code that make CMA allocation in one
step very likely to fail, especially for large memory ranges that
we need for Ducati.
Therefore, changing the allocation code to perform multiple allocation
attempts. To further increase the chances of the allocation to succeed
and to make things faster, the results of the previous allocation
attempts are kept - that is, all pages that are already isolated
stay isolated, so that retries are only for those pages that failed
isolation or migration in previous steps.
Additionally, there's a small delay between the steps so that the
chances of the other code to free the pages we need are higher.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Conflicts:
mm/page_alloc.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.
This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).
It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.
To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds the alloc_contig_range() function which tries
to allocate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Conflicts:
include/linux/gfp.h
mm/page_alloc.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Contiguous Memory Allocator is a set of helper functions for DMA
mapping framework that improves allocations of contiguous memory chunks.
CMA grabs memory on system boot, marks it with MIGRATE_CMA migrate type
and gives back to the system. Kernel is allowed to allocate only movable
pages within CMA's managed memory so that it can be used for example for
page cache when DMA mapping do not use it. On
dma_alloc_from_contiguous() request such pages are migrated out of CMA
area to free required contiguous block and fulfill the request. This
allows to allocate large contiguous chunks of memory at any time
assuming that there is enough free memory available in the system.
This code is heavily based on earlier works by Michal Nazarewicz.
Change-Id: I686c81fddee3197aa53c7668350673ce8fdb02ef
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
|
|
|
|
| |
namespace"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 98f842e675f96ffac96e6c50315790912b2812be)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generalize the proc inode allocation so that it can be
used without having to having to create a proc_dir_entry.
This will allow namespace file descriptors to remain light
weight entitities but still have the same inode number
when the backing namespace is the same.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 33d6dce607573b5fd7a43168e0d91221b3ca532b)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add a filesystem flag to mark filesystems that are safe to mount as
an unprivileged user.
- Add a filesystem flag to mark filesystems that don't need MNT_NODEV
when mounted by an unprivileged user.
- Relax the permission checks to allow unprivileged users that have
CAP_SYS_ADMIN permissions in the user namespace referred to by the
current mount namespace to be allowed to mount, unmount, and move
filesystems.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 0c55cfc4166d9a0f38de779bd4d75a90afbe7734)
|
|
|
|
|
|
|
|
| |
This will allow for support for unprivileged mounts in a new user namespace.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 771b1371686e0a63e938ada28de020b9a0040f55)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
setns support for the mount namespace is a little tricky as an
arbitrary decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship between
the two mount namespaces. Therefore I arbitrarily find the root
mount point, and follow every mount on top of it to find the top
of the mount stack. Then I set fs->root and fs->pwd to that
location. The topmost root of the mount stack seems like a
reasonable place to be.
Bind mount support for the mount namespace inodes has the
possibility of creating circular dependencies between mount
namespaces. Circular dependencies can result in loops that
prevent mount namespaces from every being freed. I avoid
creating those circular dependencies by adding a sequence number
to the mount namespace and require all bind mounts be of a
younger mount namespace into an older mount namespace.
Add a helper function proc_ns_inode so it is possible to
detect when we are attempting to bind mound a namespace inode.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 8823c079ba7136dc1948d6f6dcb5f8022bde438e)
Conflicts:
fs/namespace.c
|
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 808d4e3cfdcc52b19276175464f6dbca4df13b09)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.
I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit c3b2da314834499f34cba94f7053e55f6d6f92d8)
Conflicts:
fs/inode.c
|
|
|
|
|
|
|
|
|
| |
don't rely on proc_mounts->m being the first field; container_of()
is there for purpose. No need to bother with ->private, while
we are at it - the same container_of will do nicely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 6ce6e24e72233073c8ead9419fc5040d44803dae)
|
|
|
|
|
|
|
|
|
|
| |
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit f7a99c5b7c8bd3d3f533c8b38274e33f3da9096e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit eea62f831b8030b0eeea8314eed73b6132d1de26)
|
|
|
|
|
|
|
|
|
|
| |
Optimizing the slow paths adds a lot of complexity. If you need to
grab every lock often, you have other problems.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 9dd6fa03ab31bb57cee4623a689d058d222fbe68)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Change-Id: I6ed36e1bcac7a29132fde1667ac0f62dcda69e44
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stack for a new thread is mapped by userspace code and passed via
sys_clone. This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks. This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.
For a multithreaded program like the following:
#include <pthread.h>
void *thread_main(void *foo)
{
while(1);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}
proc/PID/maps looks like the following:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.
With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack]. All other 'stack' vmas are marked as
anonymous memory. /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].
So /proc/PID/maps will look like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack. The task level maps will
however like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
where only the vma that is being used as a stack by *that* task is
marked as [stack].
Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:
[siddhesh@localhost ~ ]$ pgrep a.out
1441
[siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
7fff6273a000 default stack anon=3 dirty=3 N0=3
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
7f8a44492000 default stack anon=2 dirty=2 N0=2
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
7fff6273a000 default stack anon=3 dirty=3 N0=3
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linus found that the gigantic size of the common audit data caused a big
perf hit on something as simple as running stat() in a loop. This patch
requires LSMs to declare the LSM specific portion separately rather than
doing it in a union. Thus each LSM can be responsible for shrinking their
portion and don't have to pay a penalty just because other LSMs have a
bigger space requirement.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
security/selinux/avc.c
|
|
|
|
|
|
|
|
|
|
| |
Once upon a time netlink was not sync and we had to get the effective
capabilities from the skb that was being received. Today we instead get
the capabilities from the current task. This has rendered the entire
purpose of the hook moot as it is now functionally equivalent to the
capable() call.
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The capabilities framework is based around credentials, not necessarily the
current task. Yet we still passed the current task down into LSMs from the
security_capable() LSM hook as if it was a meaningful portion of the security
decision. This patch removes the 'generic' passing of current and instead
forces individual LSMs to use current explicitly if they think it is
appropriate. In our case those LSMs are SELinux and AppArmor.
I believe the AppArmor use of current is incorrect, but that is wholely
unrelated to this patch. This patch does not change what AppArmor does, it
just makes it clear in the AppArmor code that it is doing it.
The SELinux code still uses current in it's audit message, which may also be
wrong and needs further investigation. Again this is NOT a change, it may
have always been wrong, this patch just makes it clear what is happening.
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Get rid of semicolon so that those expressions can be used also
somewhere else than just in an assignment.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
|
|
|
|
|
|
|
|
|
| |
This hashtable implementation is using hlist buckets to provide a simple
hashtable to prevent it from getting reimplemented all over the kernel.
Change-Id: Ie91c0b7a0537b8863d6df1e2771f54d4b731c496
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
[ Merging this now, so that subsystems can start applying Sasha's
patches that use this - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
| |
Add information about ioctl calls to the LSM audit data. Log the
file path and command number.
Bug: 18087110
Change-Id: Idbbd106db6226683cb30022d9e8f6f3b8fab7f84
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When system load is heavy, memory commpaction may encounter
transient failure; add retry mechanism. And add the following
statistic variable "compact_retry_success" in /proc/vmstat
to see how many compaction success due to the retry.
And __zone_watermark_ok() will easily return false when system
is busy, which will make compaction failure as well. Amend some
to save margin cases.
This patch is ported from the patch:
* (CR) mem compaction enhancement
Reviewed-on: http://gerrit.pcs.mot.com/494962
Change-Id: I5ad08364678c7e68993f66e0c0d43d97712a99b6
Signed-off-by: Yi-wei Zhao <gbjc64@motorola.com>
Reviewed-on: http://gerrit.pcs.mot.com/507230
Tested-by: Jira Key <JIRAKEY@motorola.com>
Reviewed-by: Jason Hrycay <jason.hrycay@motorola.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream.
This patch fixes bug in macro radix_tree_for_each_contig().
If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
radix_tree_next_chunk() switches iterating into next chunk. As result iterating
becomes non-contiguous and breaks vfs "splice" and all its users.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-and-bisected-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Reported-and-bisected-by: Ondrej Zary <linux@rainbow-software.org>
Reported-bisected-and-tested-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lkml.org/lkml/2012/6/5/64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A series of radix tree cleanups, and usage of them in the core pagecache
code.
Micro-benchmark:
lookup 14 slots (typical page-vector size)
in radix-tree there earch <step> slot filled and tagged
before/after - nsec per full scan through tree
* Intel Sandy Bridge i7-2620M 4Mb L3
New code always faster
* AMD Athlon 6000+ 2x1Mb L2, without L3
New code generally faster,
Minor degradation (marked with "*") for huge sparse trees
* i386 on Sandy Bridge
New code faster for common cases: tagged and dense trees.
Some degradations for non-tagged lookup on sparse trees.
Ideally, there might help __ffs() analog for searching first non-zero
long element in array, gcc sometimes cannot optimize this loop corretly.
Numbers:
CPU: Intel Sandy Bridge i7-2620M 4Mb L3
radix-tree with 1024 slots:
tagged lookup
step 1 before 7156 after 3613
step 2 before 5399 after 2696
step 3 before 4779 after 1928
step 4 before 4456 after 1429
step 5 before 4292 after 1213
step 6 before 4183 after 1052
step 7 before 4157 after 951
step 8 before 4016 after 812
step 9 before 3952 after 851
step 10 before 3937 after 732
step 11 before 4023 after 709
step 12 before 3872 after 657
step 13 before 3892 after 633
step 14 before 3720 after 591
step 15 before 3879 after 578
step 16 before 3561 after 513
normal lookup
step 1 before 4266 after 3301
step 2 before 2695 after 2129
step 3 before 2083 after 1712
step 4 before 1801 after 1534
step 5 before 1628 after 1313
step 6 before 1551 after 1263
step 7 before 1475 after 1185
step 8 before 1432 after 1167
step 9 before 1373 after 1092
step 10 before 1339 after 1134
step 11 before 1292 after 1056
step 12 before 1319 after 1030
step 13 before 1276 after 1004
step 14 before 1256 after 987
step 15 before 1228 after 992
step 16 before 1247 after 999
radix-tree with 1024*1024*128 slots:
tagged lookup
step 1 before 1086102841 after 674196409
step 2 before 816839155 after 498138306
step 7 before 599728907 after 240676762
step 15 before 555729253 after 185219677
step 63 before 606637748 after 128585664
step 64 before 608384432 after 102945089
step 65 before 596987114 after 123996019
step 128 before 304459225 after 56783056
step 256 before 158846855 after 31232481
step 512 before 86085652 after 18950595
step 12345 before 6517189 after 1674057
normal lookup
step 1 before 626064869 after 544418266
step 2 before 418809975 after 336321473
step 7 before 242303598 after 207755560
step 15 before 208380563 after 176496355
step 63 before 186854206 after 167283638
step 64 before 176188060 after 170143976
step 65 before 185139608 after 167487116
step 128 before 88181865 after 86913490
step 256 before 45733628 after 45143534
step 512 before 24506038 after 23859036
step 12345 before 2177425 after 2018662
* AMD Athlon 6000+ 2x1Mb L2, without L3
radix-tree with 1024 slots:
tag-lookup
step 1 before 8164 after 5379
step 2 before 5818 after 5581
step 3 before 4959 after 4213
step 4 before 4371 after 3386
step 5 before 4204 after 2997
step 6 before 4950 after 2744
step 7 before 4598 after 2480
step 8 before 4251 after 2288
step 9 before 4262 after 2243
step 10 before 4175 after 2131
step 11 before 3999 after 2024
step 12 before 3979 after 1994
step 13 before 3842 after 1929
step 14 before 3750 after 1810
step 15 before 3735 after 1810
step 16 before 3532 after 1660
normal-lookup
step 1 before 7875 after 5847
step 2 before 4808 after 4071
step 3 before 4073 after 3462
step 4 before 3677 after 3074
step 5 before 4308 after 2978
step 6 before 3911 after 3807
step 7 before 3635 after 3522
step 8 before 3313 after 3202
step 9 before 3280 after 3257
step 10 before 3166 after 3083
step 11 before 3066 after 3026
step 12 before 2985 after 2982
step 13 before 2925 after 2924
step 14 before 2834 after 2808
step 15 before 2805 after 2803
step 16 before 2647 after 2622
radix-tree with 1024*1024*128 slots:
tag-lookup
step 1 before 1288059720 after 951736580
step 2 before 961292300 after 884212140
step 7 before 768905140 after 547267580
step 15 before 771319480 after 456550640
step 63 before 504847640 after 242704304
step 64 before 392484800 after 177920786
step 65 before 491162160 after 246895264
step 128 before 208084064 after 97348392
step 256 before 112401035 after 51408126
step 512 before 75825834 after 29145070
step 12345 before 5603166 after 2847330
normal-lookup
step 1 before 1025677120 after 861375100
step 2 before 647220080 after 572258540
step 7 before 505518960 after 484041813
step 15 before 430483053 after 444815320 *
step 63 before 388113453 after 404250546 *
step 64 before 374154666 after 396027440 *
step 65 before 381423973 after 396704853 *
step 128 before 190078700 after 202619384 *
step 256 before 100886756 after 102829108 *
step 512 before 64074505 after 56158720
step 12345 before 4237289 after 4422299 *
* i686 on Sandy bridge
radix-tree with 1024 slots:
tagged lookup
step 1 before 7990 after 4019
step 2 before 5698 after 2897
step 3 before 5013 after 2475
step 4 before 4630 after 1721
step 5 before 4346 after 1759
step 6 before 4299 after 1556
step 7 before 4098 after 1513
step 8 before 4115 after 1222
step 9 before 3983 after 1390
step 10 before 4077 after 1207
step 11 before 3921 after 1231
step 12 before 3894 after 1116
step 13 before 3840 after 1147
step 14 before 3799 after 1090
step 15 before 3797 after 1059
step 16 before 3783 after 745
normal lookup
step 1 before 5103 after 3499
step 2 before 3299 after 2550
step 3 before 2489 after 2370
step 4 before 2034 after 2302 *
step 5 before 1846 after 2268 *
step 6 before 1752 after 2249 *
step 7 before 1679 after 2164 *
step 8 before 1627 after 2153 *
step 9 before 1542 after 2095 *
step 10 before 1479 after 2109 *
step 11 before 1469 after 2009 *
step 12 before 1445 after 2039 *
step 13 before 1411 after 2013 *
step 14 before 1374 after 2046 *
step 15 before 1340 after 1975 *
step 16 before 1331 after 2000 *
radix-tree with 1024*1024*128 slots:
tagged lookup
step 1 before 1225865377 after 667153553
step 2 before 842427423 after 471533007
step 7 before 609296153 after 276260116
step 15 before 544232060 after 226859105
step 63 before 519209199 after 141343043
step 64 before 588980279 after 141951339
step 65 before 521099710 after 138282060
step 128 before 298476778 after 83390628
step 256 before 149358342 after 43602609
step 512 before 76994713 after 22911077
step 12345 before 5328666 after 1472111
normal lookup
step 1 before 819284564 after 533635310
step 2 before 512421605 after 364956155
step 7 before 271443305 after 305721345 *
step 15 before 223591630 after 273960216 *
step 63 before 190320247 after 217770207 *
step 64 before 178538168 after 267411372 *
step 65 before 186400423 after 215347937 *
step 128 before 88106045 after 140540612 *
step 256 before 44812420 after 70660377 *
step 512 before 24435438 after 36328275 *
step 12345 before 2123924 after 2148062 *
bloat-o-meter delta for this patchset + patchset with related shmem cleanups
bloat-o-meter: x86_64
add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
function old new delta
radix_tree_next_chunk - 499 +499
shmem_unuse 428 554 +126
shmem_radix_tree_replace 131 227 +96
find_get_pages_tag 354 419 +65
find_get_pages_contig 345 407 +62
find_get_pages 362 396 +34
__kstrtab_radix_tree_next_chunk - 22 +22
__ksymtab_radix_tree_next_chunk - 16 +16
__kcrctab_radix_tree_next_chunk - 8 +8
radix_tree_gang_lookup_slot 204 203 -1
static.shmem_xattr_set 384 381 -3
radix_tree_gang_lookup_tag_slot 208 191 -17
radix_tree_gang_lookup 231 187 -44
radix_tree_gang_lookup_tag 247 199 -48
shmem_unlock_mapping 278 190 -88
__lookup 217 - -217
__lookup_tag 242 - -242
radix_tree_locate_item 279 - -279
bloat-o-meter: i386
add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
function old new delta
radix_tree_next_chunk - 757 +757
shmem_unuse 352 449 +97
find_get_pages_contig 269 322 +53
shmem_radix_tree_replace 113 154 +41
find_get_pages_tag 277 318 +41
dcache_dir_lseek 426 458 +32
__kstrtab_radix_tree_next_chunk - 22 +22
vc_do_resize 968 977 +9
snd_pcm_lib_read1 725 733 +8
__ksymtab_radix_tree_next_chunk - 8 +8
netlbl_cipsov4_list 1120 1127 +7
find_get_pages 293 291 -2
new_slab 467 459 -8
bitfill_unaligned_rev 425 417 -8
radix_tree_gang_lookup_tag_slot 177 146 -31
blk_dump_cmd 267 229 -38
radix_tree_gang_lookup_slot 212 134 -78
shmem_unlock_mapping 221 128 -93
radix_tree_gang_lookup_tag 275 162 -113
radix_tree_gang_lookup 255 126 -129
__lookup 227 - -227
__lookup_tag 271 - -271
radix_tree_locate_item 277 - -277
This patch:
Implement a clean, simple and effective radix-tree iteration routine.
Iterating divided into two phases:
* lookup next chunk in radix-tree leaf node
* iterating through slots in this chunk
Main iterator function radix_tree_next_chunk() returns pointer to first
slot, and stores in the struct radix_tree_iter index of next-to-last slot.
For tagged-iterating it also constuct bitmask of tags for retunted chunk.
All additional logic implemented as static-inline functions and macroses.
Also adds radix_tree_find_next_bit() static-inline variant of
find_next_bit() optimized for small constant size arrays, because
find_next_bit() too heavy for searching in an array with one/two long
elements.
[akpm@linux-foundation.org: rework comments a bit]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
| |
It is not used anymore, remove it
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have already acknowledged that swapoff of a tmpfs file is slower than
it was before conversion to the generic radix_tree: a little slower
there will be acceptable, if the hotter paths are faster.
But it was a shock to find swapoff of a 500MB file 20 times slower on my
laptop, taking 10 minutes; and at that rate it significantly slows down
my testing.
Now, most of that turned out to be overhead from PROVE_LOCKING and
PROVE_RCU: without those it was only 4 times slower than before; and
more realistic tests on other machines don't fare as badly.
I've tried a number of things to improve it, including tagging the swap
entries, then doing lookup by tag: I'd expected that to halve the time,
but in practice it's erratic, and often counter-productive.
The only change I've so far found to make a consistent improvement, is
to short-circuit the way we go back and forth, gang lookup packing
entries into the array supplied, then shmem scanning that array for the
target entry. Scanning in place doubles the speed, so it's now only
twice as slow as before (or three times slower when the PROVEs are on).
So, add radix_tree_locate_item() as an expedient, once-off,
single-caller hack to do the lookup directly in place. #ifdef it on
CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
applicability as save space in other configurations. And, sadly,
Change-Id: I2a1490f9c2728e22523f878bc009bf100226b162
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
peculiar swap vector, instead keeping a file's swap entries in the same
radix tree as its struct page pointers: thus saving memory, and
simplifying its code and locking.
This patch:
The radix_tree is used by several subsystems for different purposes. A
major use is to store the struct page pointers of a file's pagecache for
memory management. But what if mm wanted to store something other than
page pointers there too?
The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry()
case.
Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely
case, and radix_tree_exceptional_entry() to return non-0 in the second
case.
If a subsystem already uses radix_tree with that bit set, no problem: it
does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.
The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g. by the page->index of
a struct page. But that may not be feasible for some kinds of item to
be stored there.
radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets. The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From: Pradeep Sawlani <sawlani@amazon.com>
On system like Android where most of the process are forked
from parent w/o execve, KSM can scan same page multiple times
in one scan cycle. There is no advantage in scanning same page
multiple times for merging. During testing with Android, it was
observed around 60% pages are skipped for each scan cycle.
Change-Id: I0cf01802f0b4d61fcab92558beb9e1c660dc9a77
Link: http://lkml.kernel.org/r/CAMrOTPgBtANS_ryRjan0-dTL97U7eRvtf3dCsss=Kn+Uk89fuA@mail.gmail.com
Signed-off-by: Pradeep Sawlani <sawlani@amazon.com>
Signed-off-by: Paul Reioux <reioux@gmail.com>
Conflicts:
include/linux/page-flags.h
Conflicts:
include/linux/page-flags.h
|