From dce59c2faeb130855bd05d025d854ae79d8dbedd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 29 May 2012 15:06:45 -0700 Subject: mm: consider all swapped back pages in used-once logic commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream. Commit 645747462435 ("vmscan: detect mapped file pages used only once") made mapped pages have another round in inactive list because they might be just short lived and so we could consider them again next time. This heuristic helps to reduce pressure on the active list with a streaming IO worklods. This patch fixes a regression introduced by this commit for heavy shmem based workloads because unlike Anon pages, which are excluded from this heuristic because they are usually long lived, shmem pages are handled as a regular page cache. This doesn't work quite well, unfortunately, if the workload is mostly backed by shmem (in memory database sitting on 80% of memory) with a streaming IO in the background (backup - up to 20% of memory). Anon inactive list is full of (dirty) shmem pages when watermarks are hit. Shmem pages are kept in the inactive list (they are referenced) in the first round and it is hard to reclaim anything else so we reach lower scanning priorities very quickly which leads to an excessive swap out. Let's fix this by excluding all swap backed pages (they tend to be long lived wrt. the regular page cache anyway) from used-once heuristic and rather activate them if they are referenced. The customer's workload is shmem backed database (80% of RAM) and they are measuring transactions/s with an IO in the background (20%). Transactions touch more or less random rows in the table. The transaction rate fell by a factor of 3 (in the worst case) because of commit 64574746. This patch restores the previous numbers. Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Cc: Mel Gorman Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 6072d74..769935d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -665,7 +665,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; if (referenced_ptes) { - if (PageAnon(page)) + if (PageSwapBacked(page)) return PAGEREF_ACTIVATE; /* * All mapped pages start out with page table -- cgit v1.1 From 32ef2126fabc8a984506a1a4e83f3459ba1a2075 Mon Sep 17 00:00:00 2001 From: Jiang Liu Date: Wed, 11 Jul 2012 14:01:52 -0700 Subject: memory hotplug: fix invalid memory access caused by stale kswapd pointer commit d8adde17e5f858427504725218c56aef90e90fc7 upstream. kswapd_stop() is called to destroy the kswapd work thread when all memory of a NUMA node has been offlined. But kswapd_stop() only terminates the work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale pointer will prevent kswapd_run() from creating a new work thread when adding memory to the memory-less NUMA node again. Eventually the stale pointer may cause invalid memory access. An example stack dump as below. It's reproduced with 2.6.32, but latest kernel has the same issue. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] exit_creds+0x12/0x78 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/memory/memory391/state CPU 11 Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285 RIP: 0010:exit_creds+0x12/0x78 RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10 R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000 R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001 FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600) Stack: ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1 0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003 Call Trace: __put_task_struct+0x5d/0x97 kthread_stop+0x50/0x58 offline_pages+0x324/0x3da memory_block_change_state+0x179/0x1db store_mem_state+0x9e/0xbb sysfs_write_file+0xd0/0x107 vfs_write+0xad/0x169 sys_write+0x45/0x6e system_call_fastpath+0x16/0x1b Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0 RIP exit_creds+0x12/0x78 RSP CR2: 0000000000000000 [akpm@linux-foundation.org: add pglist_data.kswapd locking comments] Signed-off-by: Xishi Qiu Signed-off-by: Jiang Liu Acked-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Acked-by: Mel Gorman Acked-by: David Rientjes Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 769935d..1b0ed36 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2952,14 +2952,17 @@ int kswapd_run(int nid) } /* - * Called by memory hotplug when all memory in a node is offlined. + * Called by memory hotplug when all memory in a node is offlined. Caller must + * hold lock_memory_hotplug(). */ void kswapd_stop(int nid) { struct task_struct *kswapd = NODE_DATA(nid)->kswapd; - if (kswapd) + if (kswapd) { kthread_stop(kswapd); + NODE_DATA(nid)->kswapd = NULL; + } } static int __init kswapd_init(void) -- cgit v1.1 From 6d40de834ce9bb2964a70b7f91a98406eceb0399 Mon Sep 17 00:00:00 2001 From: Aaditya Kumar Date: Tue, 17 Jul 2012 15:48:07 -0700 Subject: mm: fix lost kswapd wakeup in kswapd_stop() commit 1c7e7f6c0703d03af6bcd5ccc11fc15d23e5ecbe upstream. Offlining memory may block forever, waiting for kswapd() to wake up because kswapd() does not check the event kthread->should_stop before sleeping. The proper pattern, from Documentation/memory-barriers.txt, is: --- waker --- event_indicated = 1; wake_up_process(event_daemon); --- sleeper --- for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (event_indicated) break; schedule(); } set_current_state() may be wrapped by: prepare_to_wait(); In the kswapd() case, event_indicated is kthread->should_stop. === offlining memory (waker) === kswapd_stop() kthread_stop() kthread->should_stop = 1 wake_up_process() wait_for_completion() === kswapd_try_to_sleep (sleeper) === kswapd_try_to_sleep() prepare_to_wait() . . schedule() . . finish_wait() The schedule() needs to be protected by a test of kthread->should_stop, which is wrapped by kthread_should_stop(). Reproducer: Do heavy file I/O in background. Do a memory offline/online in a tight loop Signed-off-by: Aaditya Kumar Acked-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1b0ed36..130fa32 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2695,7 +2695,10 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) * them before going back to sleep. */ set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); - schedule(); + + if (!kthread_should_stop()) + schedule(); + set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold); } else { if (remaining) -- cgit v1.1 From 33c17eafdeefb08fbb6ded946abcf024f76c9615 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 14 Sep 2011 16:21:52 -0700 Subject: mm: vmscan: fix force-scanning small targets without swap commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream. Stable note: Not tracked in Bugzilla. This patch augments an earlier commit that avoids scanning priority being artificially raised. The older fix was particularly important for small memcgs to avoid calling wait_iff_congested() unnecessarily. Without swap, anonymous pages are not scanned. As such, they should not count when considering force-scanning a small target if there is no swap. Otherwise, targets are not force-scanned even when their effective scan number is zero and the other conditions--kswapd/memcg--apply. This fixes 246e87a93934 ("memcg: fix get_scan_count() for small targets"). [akpm@linux-foundation.org: fix comment] Signed-off-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Reviewed-by: Michal Hocko Cc: Ying Han Cc: Balbir Singh Cc: KOSAKI Motohiro Cc: Daisuke Nishimura Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 130fa32..347bb447 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1747,23 +1747,15 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc, u64 fraction[2], denominator; enum lru_list l; int noswap = 0; - int force_scan = 0; + bool force_scan = false; unsigned long nr_force_scan[2]; - - anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) + - zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON); - file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) + - zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE); - - if (((anon + file) >> priority) < SWAP_CLUSTER_MAX) { - /* kswapd does zone balancing and need to scan this zone */ - if (scanning_global_lru(sc) && current_is_kswapd()) - force_scan = 1; - /* memcg may have small limit and need to avoid priority drop */ - if (!scanning_global_lru(sc)) - force_scan = 1; - } + /* kswapd does zone balancing and needs to scan this zone */ + if (scanning_global_lru(sc) && current_is_kswapd()) + force_scan = true; + /* memcg may have small limit and need to avoid priority drop */ + if (!scanning_global_lru(sc)) + force_scan = true; /* If we have no swap space, do not bother scanning anon pages. */ if (!sc->may_swap || (nr_swap_pages <= 0)) { @@ -1776,6 +1768,11 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc, goto out; } + anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) + + zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON); + file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) + + zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE); + if (scanning_global_lru(sc)) { free = zone_page_state(zone, NR_FREE_PAGES); /* If we have very few page cache pages, -- cgit v1.1 From 564ea9dd5ab042cb2fe8373f4d627073706e1d4f Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Thu, 25 Aug 2011 15:59:12 -0700 Subject: vmscan: clear ZONE_CONGESTED for zone with good watermark commit 439423f6894aa0dec22187526827456f5004baed upstream. Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing ZONE_CONGESTED after it balances a zone and this patch fixes a bug where that was failing to happen. Without this patch, processes can stall in wait_iff_congested unnecessarily. For users, this can look like an interactivity stall but some workloads would see it as sudden drop in throughput. ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any task. It's possible ZONE_CONGESTED isn't cleared in some cases: 1. the zone is already balanced just entering balance_pgdat() for order-0 because concurrent tasks free memory. In this case, later check will skip the zone as it's balanced so the flag isn't cleared. 2. high order balance fallbacks to order-0. quote from Mel: At the end of balance_pgdat(), kswapd uses the following logic; If reclaiming at high order { for each zone { if all_unreclaimable skip if watermark is not met order = 0 loop again /* watermark is met */ clear congested } } i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not, it restarts balancing at order-0. However, if the higher zones are balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as that only happens after a zone is shrunk. This can mean that wait_iff_congested() stalls unnecessarily. This patch makes kswapd clear ZONE_CONGESTED during its initial highmem->dma scan for zones that are already balanced. Signed-off-by: Shaohua Li Acked-by: Mel Gorman Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 347bb447..6b0f8a6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2456,6 +2456,9 @@ loop_again: high_wmark_pages(zone), 0, 0)) { end_zone = i; break; + } else { + /* If balanced, clear the congested flag */ + zone_clear_flag(zone, ZONE_CONGESTED); } } if (i < 0) -- cgit v1.1 From 5e5b3d2ed3aee6f8bbe38c0945876aacce11ff03 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 8 Jul 2011 14:14:34 +1000 Subject: vmscan: add shrink_slab tracepoints commit 095760730c1047c69159ce88021a7fa3833502c8 upstream. Stable note: This patch makes later patches easier to apply but otherwise has little to justify it. It is a diagnostic patch that was part of a series addressing excessive slab shrinking after GFP_NOFS failures. There is detailed information on the series' motivation at https://lkml.org/lkml/2011/6/2/42 . It is impossible to understand what the shrinkers are actually doing without instrumenting the code, so add a some tracepoints to allow insight to be gained. Signed-off-by: Dave Chinner Signed-off-by: Al Viro Signed-off-by: Mel Gorman --- mm/vmscan.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 6b0f8a6..abc7981 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink, unsigned long long delta; unsigned long total_scan; unsigned long max_pass; + int shrink_ret = 0; max_pass = do_shrinker_shrink(shrinker, shrink, 0); delta = (4 * nr_pages_scanned) / shrinker->seeks; @@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_control *shrink, total_scan = shrinker->nr; shrinker->nr = 0; + trace_mm_shrink_slab_start(shrinker, shrink, total_scan, + nr_pages_scanned, lru_pages, + max_pass, delta, total_scan); + while (total_scan >= SHRINK_BATCH) { long this_scan = SHRINK_BATCH; - int shrink_ret; int nr_before; nr_before = do_shrinker_shrink(shrinker, shrink, 0); @@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, } shrinker->nr += total_scan; + trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan, + shrinker->nr); } up_read(&shrinker_rwsem); out: -- cgit v1.1 From 6a5091a09f9278f8f821e3f33ac748633d143cea Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 8 Jul 2011 14:14:35 +1000 Subject: vmscan: shrinker->nr updates race and go wrong commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream. Stable note: Not tracked in Bugzilla. This patch reduces excessive reclaim of slab objects reducing the amount of information that has to be brought back in from disk. shrink_slab() allows shrinkers to be called in parallel so the struct shrinker can be updated concurrently. It does not provide any exclusio for such updates, so we can get the shrinker->nr value increasing or decreasing incorrectly. As a result, when a shrinker repeatedly returns a value of -1 (e.g. a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire, sometimes updating with the scan count that wasn't used, sometimes losing it altogether. Worse is when a shrinker does work and that update is lost due to racy updates, which means the shrinker will do the work again! Fix this by making the total_scan calculations independent of shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to other updates via cmpxchg loops. Signed-off-by: Dave Chinner Signed-off-by: Al Viro Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++------------- 1 file changed, 32 insertions(+), 13 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index abc7981..bb7df1e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink, unsigned long total_scan; unsigned long max_pass; int shrink_ret = 0; + long nr; + long new_nr; + /* + * copy the current shrinker scan count into a local variable + * and zero it so that other concurrent shrinker invocations + * don't also do this scanning work. + */ + do { + nr = shrinker->nr; + } while (cmpxchg(&shrinker->nr, nr, 0) != nr); + + total_scan = nr; max_pass = do_shrinker_shrink(shrinker, shrink, 0); delta = (4 * nr_pages_scanned) / shrinker->seeks; delta *= max_pass; do_div(delta, lru_pages + 1); - shrinker->nr += delta; - if (shrinker->nr < 0) { + total_scan += delta; + if (total_scan < 0) { printk(KERN_ERR "shrink_slab: %pF negative objects to " "delete nr=%ld\n", - shrinker->shrink, shrinker->nr); - shrinker->nr = max_pass; + shrinker->shrink, total_scan); + total_scan = max_pass; } /* @@ -269,13 +281,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, * never try to free more than twice the estimate number of * freeable entries. */ - if (shrinker->nr > max_pass * 2) - shrinker->nr = max_pass * 2; - - total_scan = shrinker->nr; - shrinker->nr = 0; + if (total_scan > max_pass * 2) + total_scan = max_pass * 2; - trace_mm_shrink_slab_start(shrinker, shrink, total_scan, + trace_mm_shrink_slab_start(shrinker, shrink, nr, nr_pages_scanned, lru_pages, max_pass, delta, total_scan); @@ -296,9 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink, cond_resched(); } - shrinker->nr += total_scan; - trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan, - shrinker->nr); + /* + * move the unused scan count back into the shrinker in a + * manner that handles concurrent updates. If we exhausted the + * scan, there is no need to do an update. + */ + do { + nr = shrinker->nr; + new_nr = total_scan + nr; + if (total_scan <= 0) + break; + } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); + + trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); } up_read(&shrinker_rwsem); out: -- cgit v1.1 From 7554e3446a916363447a81a29f9300d3f2fbf503 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 8 Jul 2011 14:14:36 +1000 Subject: vmscan: reduce wind up shrinker->nr when shrinker can't do work commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream. Stable note: Not tracked in Bugzilla. This patch reduces excessive reclaim of slab objects reducing the amount of information that has to be brought back in from disk. The third and fourth paragram in the series describes the impact. When a shrinker returns -1 to shrink_slab() to indicate it cannot do any work given the current memory reclaim requirements, it adds the entire total_scan count to shrinker->nr. The idea ehind this is that whenteh shrinker is next called and can do work, it will do the work of the previously aborted shrinker call as well. However, if a filesystem is doing lots of allocation with GFP_NOFS set, then we get many, many more aborts from the shrinkers than we do successful calls. The result is that shrinker->nr winds up to it's maximum permissible value (twice the current cache size) and then when the next shrinker call that can do work is issued, it has enough scan count built up to free the entire cache twice over. This manifests itself in the cache going from full to empty in a matter of seconds, even when only a small part of the cache is needed to be emptied to free sufficient memory. Under metadata intensive workloads on ext4 and XFS, I'm seeing the VFS caches increase memory consumption up to 75% of memory (no page cache pressure) over a period of 30-60s, and then the shrinker empties them down to zero in the space of 2-3s. This cycle repeats over and over again, with the shrinker completely trashing the inode and dentry caches every minute or so the workload continues. This behaviour was made obvious by the shrink_slab tracepoints added earlier in the series, and made worse by the patch that corrected the concurrent accounting of shrinker->nr. To avoid this problem, stop repeated small increments of the total scan value from winding shrinker->nr up to a value that can cause the entire cache to be freed. We still need to allow it to wind up, so use the delta as the "large scan" threshold check - if the delta is more than a quarter of the entire cache size, then it is a large scan and allowed to cause lots of windup because we are clearly needing to free lots of memory. If it isn't a large scan then limit the total scan to half the size of the cache so that windup never increases to consume the whole cache. Reducing the total scan limit further does not allow enough wind-up to maintain the current levels of performance, whilst a higher threshold does not prevent the windup from freeing the entire cache under sustained workloads. Signed-off-by: Dave Chinner Signed-off-by: Al Viro Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index bb7df1e..d375de3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -277,6 +277,21 @@ unsigned long shrink_slab(struct shrink_control *shrink, } /* + * We need to avoid excessive windup on filesystem shrinkers + * due to large numbers of GFP_NOFS allocations causing the + * shrinkers to return -1 all the time. This results in a large + * nr being built up so when a shrink that can do some work + * comes along it empties the entire cache due to nr >>> + * max_pass. This is bad for sustaining a working set in + * memory. + * + * Hence only allow the shrinker to scan the entire cache when + * a large delta change is calculated directly. + */ + if (delta < max_pass / 4) + total_scan = min(total_scan, max_pass / 2); + + /* * Avoid risking looping forever due to too large nr value: * never try to free more than twice the estimate number of * freeable entries. -- cgit v1.1 From 4d4724067d512e7f17010112da8ec64917c192e7 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Mon, 31 Oct 2011 17:09:31 -0700 Subject: vmscan: limit direct reclaim for higher order allocations commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. Paragraph 3 of this changelog is the motivation for this patch. When suffering from memory fragmentation due to unfreeable pages, THP page faults will repeatedly try to compact memory. Due to the unfreeable pages, compaction fails. Needless to say, at that point page reclaim also fails to create free contiguous 2MB areas. However, that doesn't stop the current code from trying, over and over again, and freeing a minimum of 4MB (2UL << sc->order pages) at every single invocation. This resulted in my 12GB system having 2-3GB free memory, a corresponding amount of used swap and very sluggish response times. This can be avoided by having the direct reclaim code not reclaim from zones that already have plenty of free memory available for compaction. If compaction still fails due to unmovable memory, doing additional reclaim will only hurt the system, not help. [jweiner@redhat.com: change comment to explain the order check] Signed-off-by: Rik van Riel Acked-by: Johannes Weiner Acked-by: Mel Gorman Cc: Andrea Arcangeli Reviewed-by: Minchan Kim Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d375de3..e1ae88b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2059,6 +2059,22 @@ static void shrink_zones(int priority, struct zonelist *zonelist, continue; if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; /* Let kswapd poll it */ + if (COMPACTION_BUILD) { + /* + * If we already have plenty of memory + * free for compaction, don't free any + * more. Even though compaction is + * invoked for any non-zero order, + * only frequent costly order + * reclamation is disruptive enough to + * become a noticable problem, like + * transparent huge page allocations. + */ + if (sc->order > PAGE_ALLOC_COSTLY_ORDER && + (compaction_suitable(zone, sc->order) || + compaction_deferred(zone))) + continue; + } /* * This steals pages from memory cgroups over softlimit * and returns the number of reclaimed pages and -- cgit v1.1 From 4682e89d1455d66e2536d9efb2875d61a1f1f294 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 31 Oct 2011 17:09:33 -0700 Subject: vmscan: abort reclaim/compaction if compaction can proceed commit e0c23279c9f800c403f37511484d9014ac83adec upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. If compaction can proceed, shrink_zones() stops doing any work but its callers still call shrink_slab() which raises the priority and potentially sleeps. This is unnecessary and wasteful so this patch aborts direct reclaim/compaction entirely if compaction can proceed. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Reviewed-by: Minchan Kim Acked-by: Johannes Weiner Cc: Josh Boyer Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index e1ae88b..b146b42 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2037,14 +2037,19 @@ restart: * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. + * + * This function returns true if a zone is being reclaimed for a costly + * high-order allocation and compaction is either ready to begin or deferred. + * This indicates to the caller that it should retry the allocation or fail. */ -static void shrink_zones(int priority, struct zonelist *zonelist, +static bool shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + bool should_abort_reclaim = false; for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(sc->gfp_mask), sc->nodemask) { @@ -2061,19 +2066,20 @@ static void shrink_zones(int priority, struct zonelist *zonelist, continue; /* Let kswapd poll it */ if (COMPACTION_BUILD) { /* - * If we already have plenty of memory - * free for compaction, don't free any - * more. Even though compaction is - * invoked for any non-zero order, - * only frequent costly order - * reclamation is disruptive enough to - * become a noticable problem, like - * transparent huge page allocations. + * If we already have plenty of memory free for + * compaction in this zone, don't free any more. + * Even though compaction is invoked for any + * non-zero order, only frequent costly order + * reclamation is disruptive enough to become a + * noticable problem, like transparent huge page + * allocations. */ if (sc->order > PAGE_ALLOC_COSTLY_ORDER && (compaction_suitable(zone, sc->order) || - compaction_deferred(zone))) + compaction_deferred(zone))) { + should_abort_reclaim = true; continue; + } } /* * This steals pages from memory cgroups over softlimit @@ -2092,6 +2098,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist, shrink_zone(priority, zone, sc); } + + return should_abort_reclaim; } static bool zone_reclaimable(struct zone *zone) @@ -2156,7 +2164,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(sc->mem_cgroup); - shrink_zones(priority, zonelist, sc); + if (shrink_zones(priority, zonelist, sc)) + break; + /* * Don't shrink slabs when reclaiming memory from * over limit cgroups -- cgit v1.1 From a15a3971cc49eefbde40b397a446c0fa9c5fed9c Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 31 Oct 2011 17:06:47 -0700 Subject: mm: change isolate mode from #define to bitwise type commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but has no other impact. Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Rik van Riel Cc: Michal Hocko Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 37 ++++++++++++++++++++----------------- 1 file changed, 20 insertions(+), 17 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index b146b42..9267ba1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1012,23 +1012,27 @@ keep_lumpy: * * returns 0 on success, -ve errno on failure. */ -int __isolate_lru_page(struct page *page, int mode, int file) +int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) { + bool all_lru_mode; int ret = -EINVAL; /* Only take pages on the LRU. */ if (!PageLRU(page)) return ret; + all_lru_mode = (mode & (ISOLATE_ACTIVE|ISOLATE_INACTIVE)) == + (ISOLATE_ACTIVE|ISOLATE_INACTIVE); + /* * When checking the active state, we need to be sure we are * dealing with comparible boolean values. Take the logical not * of each. */ - if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) + if (!all_lru_mode && !PageActive(page) != !(mode & ISOLATE_ACTIVE)) return ret; - if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file) + if (!all_lru_mode && !!page_is_file_cache(page) != file) return ret; /* @@ -1076,7 +1080,8 @@ int __isolate_lru_page(struct page *page, int mode, int file) */ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, struct list_head *src, struct list_head *dst, - unsigned long *scanned, int order, int mode, int file) + unsigned long *scanned, int order, isolate_mode_t mode, + int file) { unsigned long nr_taken = 0; unsigned long nr_lumpy_taken = 0; @@ -1201,8 +1206,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, static unsigned long isolate_pages_global(unsigned long nr, struct list_head *dst, unsigned long *scanned, int order, - int mode, struct zone *z, - int active, int file) + isolate_mode_t mode, + struct zone *z, int active, int file) { int lru = LRU_BASE; if (active) @@ -1448,6 +1453,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, unsigned long nr_taken; unsigned long nr_anon; unsigned long nr_file; + isolate_mode_t reclaim_mode = ISOLATE_INACTIVE; while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1458,15 +1464,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, } set_reclaim_mode(priority, sc, false); + if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM) + reclaim_mode |= ISOLATE_ACTIVE; + lru_add_drain(); spin_lock_irq(&zone->lru_lock); if (scanning_global_lru(sc)) { - nr_taken = isolate_pages_global(nr_to_scan, - &page_list, &nr_scanned, sc->order, - sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ? - ISOLATE_BOTH : ISOLATE_INACTIVE, - zone, 0, file); + nr_taken = isolate_pages_global(nr_to_scan, &page_list, + &nr_scanned, sc->order, reclaim_mode, zone, 0, file); zone->pages_scanned += nr_scanned; if (current_is_kswapd()) __count_zone_vm_events(PGSCAN_KSWAPD, zone, @@ -1475,12 +1481,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned); } else { - nr_taken = mem_cgroup_isolate_pages(nr_to_scan, - &page_list, &nr_scanned, sc->order, - sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ? - ISOLATE_BOTH : ISOLATE_INACTIVE, - zone, sc->mem_cgroup, - 0, file); + nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list, + &nr_scanned, sc->order, reclaim_mode, zone, + sc->mem_cgroup, 0, file); /* * mem_cgroup_isolate_pages() keeps track of * scanned pages on its own. -- cgit v1.1 From 19faec0520b3b16dfd58cde30938a3c4d3dcdd5b Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 31 Oct 2011 17:06:51 -0700 Subject: mm: compaction: make isolate_lru_page() filter-aware commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream. Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list leading to poor reclaim decisions which has a variable performance impact. In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: Minchan Kim Acked-by: Johannes Weiner Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: KOSAKI Motohiro Acked-by: Mel Gorman Acked-by: Rik van Riel Reviewed-by: Michal Hocko Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 9267ba1..6835735 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) ret = -EBUSY; + if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page))) + return ret; + if (likely(get_page_unless_zero(page))) { /* * Be careful not to clear PageLRU until after we're -- cgit v1.1 From 5e02dde6aee7c4492b3a62ad93e7f1120877a019 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 31 Oct 2011 17:06:55 -0700 Subject: mm: zone_reclaim: make isolate_lru_page() filter-aware commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream. Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list leading to poor reclaim decisions which has a variable performance impact. In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: Minchan Kim Acked-by: Johannes Weiner Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: KOSAKI Motohiro Reviewed-by: Michal Hocko Cc: Mel Gorman Cc: Rik van Riel Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 6835735..0c78bd3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page))) return ret; + if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) + return ret; + if (likely(get_page_unless_zero(page))) { /* * Be careful not to clear PageLRU until after we're @@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, reclaim_mode |= ISOLATE_ACTIVE; lru_add_drain(); + + if (!sc->may_unmap) + reclaim_mode |= ISOLATE_UNMAPPED; + if (!sc->may_writepage) + reclaim_mode |= ISOLATE_CLEAN; + spin_lock_irq(&zone->lru_lock); if (scanning_global_lru(sc)) { @@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct page *page; struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); unsigned long nr_rotated = 0; + isolate_mode_t reclaim_mode = ISOLATE_ACTIVE; lru_add_drain(); + + if (!sc->may_unmap) + reclaim_mode |= ISOLATE_UNMAPPED; + if (!sc->may_writepage) + reclaim_mode |= ISOLATE_CLEAN; + spin_lock_irq(&zone->lru_lock); if (scanning_global_lru(sc)) { nr_taken = isolate_pages_global(nr_pages, &l_hold, &pgscanned, sc->order, - ISOLATE_ACTIVE, zone, + reclaim_mode, zone, 1, file); zone->pages_scanned += pgscanned; } else { nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order, - ISOLATE_ACTIVE, zone, + reclaim_mode, zone, sc->mem_cgroup, 1, file); /* * mem_cgroup_isolate_pages() keeps track of -- cgit v1.1 From a7e32d7a2a801b7838b4159e9d73ea86f68ae002 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:38 -0800 Subject: mm: compaction: make isolate_lru_page() filter-aware again commit c82449352854ff09e43062246af86bdeb628f0c3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 0c78bd3..45c40d6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1045,8 +1045,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) ret = -EBUSY; - if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page))) - return ret; + /* + * To minimise LRU disruption, the caller can indicate that it only + * wants to isolate pages it will be able to operate on without + * blocking - clean pages for the most part. + * + * ISOLATE_CLEAN means that only clean pages should be isolated. This + * is used by reclaim when it is cannot write to backing storage + * + * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages + * that it is possible to migrate without blocking + */ + if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) { + /* All the caller can do on PageWriteback is block */ + if (PageWriteback(page)) + return ret; + + if (PageDirty(page)) { + struct address_space *mapping; + + /* ISOLATE_CLEAN means only clean pages */ + if (mode & ISOLATE_CLEAN) + return ret; + + /* + * Only pages without mappings or that have a + * ->migratepage callback are possible to migrate + * without blocking + */ + mapping = page_mapping(page); + if (mapping && !mapping->a_ops->migratepage) + return ret; + } + } if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) return ret; -- cgit v1.1 From 5d62e5ca429b85ecadaa5042bdb1d8b88d4bfe80 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Mon, 31 Oct 2011 17:08:39 -0700 Subject: kswapd: avoid unnecessary rebalance after an unsuccessful balancing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream. Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This patch reduces kswapd CPU usage. In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully") , Mel Gorman said kswapd is better to sleep after a unsuccessful balancing if there is tighter reclaim request pending in the balancing. But in the following scenario, kswapd do something that is not matched our expectation. The patch fixes this issue. 1, Read pgdat request A (classzone_idx, order = 3) 2, balance_pgdat() 3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed 4, balance_pgdat() returns but failed since returned order = 0 5, pgdat of request A assigned to balance_pgdat(), and do balancing again. While the expectation behavior of kswapd should try to sleep. Signed-off-by: Alex Shi Reviewed-by: Tim Chen Acked-by: Mel Gorman Tested-by: Pádraig Brady Cc: Rik van Riel Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 45c40d6..5cc0f92 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2844,7 +2844,9 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) static int kswapd(void *p) { unsigned long order, new_order; + unsigned balanced_order; int classzone_idx, new_classzone_idx; + int balanced_classzone_idx; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; @@ -2875,7 +2877,9 @@ static int kswapd(void *p) set_freezable(); order = new_order = 0; + balanced_order = 0; classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; + balanced_classzone_idx = classzone_idx; for ( ; ; ) { int ret; @@ -2884,7 +2888,8 @@ static int kswapd(void *p) * new request of a similar or harder type will succeed soon * so consider going to sleep on the basis we reclaimed at */ - if (classzone_idx >= new_classzone_idx && order == new_order) { + if (balanced_classzone_idx >= new_classzone_idx && + balanced_order == new_order) { new_order = pgdat->kswapd_max_order; new_classzone_idx = pgdat->classzone_idx; pgdat->kswapd_max_order = 0; @@ -2899,7 +2904,8 @@ static int kswapd(void *p) order = new_order; classzone_idx = new_classzone_idx; } else { - kswapd_try_to_sleep(pgdat, order, classzone_idx); + kswapd_try_to_sleep(pgdat, balanced_order, + balanced_classzone_idx); order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; pgdat->kswapd_max_order = 0; @@ -2916,7 +2922,9 @@ static int kswapd(void *p) */ if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); - order = balance_pgdat(pgdat, order, &classzone_idx); + balanced_classzone_idx = classzone_idx; + balanced_order = balance_pgdat(pgdat, order, + &balanced_classzone_idx); } } return 0; -- cgit v1.1 From 9203b3fa57cc16bf1ad1be7c64b01b5e45cc6151 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Mon, 31 Oct 2011 17:08:45 -0700 Subject: kswapd: assign new_order and new_classzone_idx after wakeup in sleeping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream. Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This patch reduces kswapd CPU usage. There 2 places to read pgdat in kswapd. One is return from a successful balance, another is waked up from kswapd sleeping. The new_order and new_classzone_idx represent the balance input order and classzone_idx. But current new_order and new_classzone_idx are not assigned after kswapd_try_to_sleep(), that will cause a bug in the following scenario. 1: after a successful balance, kswapd goes to sleep, and new_order = 0; new_classzone_idx = __MAX_NR_ZONES - 1; 2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL 3: in the balance_pgdat() running, a new balance wakeup happened with order = 5, and classzone_idx = ZONE_NORMAL 4: the first wakeup(order = 3) finished successufly, return order = 3 but, the new_order is still 0, so, this balancing will be treated as a failed balance. And then the second tighter balancing will be missed. So, to avoid the above problem, the new_order and new_classzone_idx need to be assigned for later successful comparison. Signed-off-by: Alex Shi Acked-by: Mel Gorman Reviewed-by: Minchan Kim Tested-by: Pádraig Brady Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5cc0f92..870cbcf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2908,6 +2908,8 @@ static int kswapd(void *p) balanced_classzone_idx); order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; + new_order = order; + new_classzone_idx = classzone_idx; pgdat->kswapd_max_order = 0; pgdat->classzone_idx = pgdat->nr_zones - 1; } -- cgit v1.1 From d50462a3a29fc5f53ef4a5d74eb693b4d4cb1512 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:45 -0800 Subject: mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. This patch addresses a problem where the fix regressed THP allocation success rates. In commit e0887c19 ("vmscan: limit direct reclaim for higher order allocations"), Rik noted that reclaim was too aggressive when THP was enabled. In his initial patch he used the number of free pages to decide if reclaim should abort for compaction. My feedback was that reclaim and compaction should be using the same logic when deciding if reclaim should be aborted. Unfortunately, this had the effect of reducing THP success rates when the workload included something like streaming reads that continually allocated pages. The window during which compaction could run and return a THP was too small. This patch combines Rik's two patches together. compaction_suitable() is still used to decide if reclaim should be aborted to allow compaction is used. However, it will also ensure that there is a reasonable buffer of free pages available. This improves upon the THP allocation success rates but bounds the number of pages that are freed for compaction. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 44 +++++++++++++++++++++++++++++++++++++++----- 1 file changed, 39 insertions(+), 5 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 870cbcf..eadab09 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2075,6 +2075,42 @@ restart: throttle_vm_writeout(sc->gfp_mask); } +/* Returns true if compaction should go ahead for a high-order request */ +static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) +{ + unsigned long balance_gap, watermark; + bool watermark_ok; + + /* Do not consider compaction for orders reclaim is meant to satisfy */ + if (sc->order <= PAGE_ALLOC_COSTLY_ORDER) + return false; + + /* + * Compaction takes time to run and there are potentially other + * callers using the pages just freed. Continue reclaiming until + * there is a buffer of free pages available to give compaction + * a reasonable chance of completing and allocating the page + */ + balance_gap = min(low_wmark_pages(zone), + (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / + KSWAPD_ZONE_BALANCE_GAP_RATIO); + watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order); + watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0); + + /* + * If compaction is deferred, reclaim up to a point where + * compaction will have a chance of success when re-enabled + */ + if (compaction_deferred(zone)) + return watermark_ok; + + /* If compaction is not ready to start, keep reclaiming */ + if (!compaction_suitable(zone, sc->order)) + return false; + + return watermark_ok; +} + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -2092,8 +2128,8 @@ restart: * scan then give up on it. * * This function returns true if a zone is being reclaimed for a costly - * high-order allocation and compaction is either ready to begin or deferred. - * This indicates to the caller that it should retry the allocation or fail. + * high-order allocation and compaction is ready to begin. This indicates to + * the caller that it should retry the allocation or fail. */ static bool shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) @@ -2127,9 +2163,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, * noticable problem, like transparent huge page * allocations. */ - if (sc->order > PAGE_ALLOC_COSTLY_ORDER && - (compaction_suitable(zone, sc->order) || - compaction_deferred(zone))) { + if (compaction_ready(zone, sc)) { should_abort_reclaim = true; continue; } -- cgit v1.1 From da0dc52b5236e73d7d7b5a58e8e067236fd1323e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:33 -0800 Subject: mm: vmscan: do not OOM if aborting reclaim to start compaction commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but otherwise has little to justify it. The problem it fixes was never observed but the source of the theoretical problem did not exist for very long. During direct reclaim it is possible that reclaim will be aborted so that compaction can be attempted to satisfy a high-order allocation. If this decision is made before any pages are reclaimed, it is possible that 0 is returned to the page allocator potentially triggering an OOM. This has not been observed but it is a possibility so this patch addresses it. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index eadab09..441f97e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2240,6 +2240,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, struct zoneref *z; struct zone *zone; unsigned long writeback_threshold; + bool should_abort_reclaim; get_mems_allowed(); delayacct_freepages_start(); @@ -2251,7 +2252,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(sc->mem_cgroup); - if (shrink_zones(priority, zonelist, sc)) + should_abort_reclaim = shrink_zones(priority, zonelist, sc); + if (should_abort_reclaim) break; /* @@ -2318,6 +2320,10 @@ out: if (oom_killer_disabled) return 0; + /* Aborting reclaim to try compaction? don't OOM, then */ + if (should_abort_reclaim) + return 1; + /* top priority shrink_zones still had more to do? don't OOM, then */ if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc)) return 1; -- cgit v1.1 From 9cad5d6a3ce8ffc5fee70c6514ccb7ce003b8792 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:49 -0800 Subject: mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. If compaction can proceed for a given zone, shrink_zones() does not reclaim any more pages from it. After commit [e0c2327: vmscan: abort reclaim/compaction if compaction can proceed], do_try_to_free_pages() tries to finish as soon as possible once one zone can compact. This was intended to prevent slabs being shrunk unnecessarily but there are side-effects. One is that a small zone that is ready for compaction will abort reclaim even if the chances of successfully allocating a THP from that zone is small. It also means that reclaim can return too early even though sc->nr_to_reclaim pages were not reclaimed. This partially reverts the commit until it is proven that slabs are really being shrunk unnecessarily but preserves the check to return 1 to avoid OOM if reclaim was aborted prematurely. [aarcange@redhat.com: This patch replaces a revert from Andrea] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 441f97e..f8c96c7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2129,7 +2129,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) * * This function returns true if a zone is being reclaimed for a costly * high-order allocation and compaction is ready to begin. This indicates to - * the caller that it should retry the allocation or fail. + * the caller that it should consider retrying the allocation instead of + * further reclaim. */ static bool shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) @@ -2138,7 +2139,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, struct zone *zone; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; - bool should_abort_reclaim = false; + bool aborted_reclaim = false; for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(sc->gfp_mask), sc->nodemask) { @@ -2164,7 +2165,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, * allocations. */ if (compaction_ready(zone, sc)) { - should_abort_reclaim = true; + aborted_reclaim = true; continue; } } @@ -2186,7 +2187,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, shrink_zone(priority, zone, sc); } - return should_abort_reclaim; + return aborted_reclaim; } static bool zone_reclaimable(struct zone *zone) @@ -2240,7 +2241,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, struct zoneref *z; struct zone *zone; unsigned long writeback_threshold; - bool should_abort_reclaim; + bool aborted_reclaim; get_mems_allowed(); delayacct_freepages_start(); @@ -2252,9 +2253,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(sc->mem_cgroup); - should_abort_reclaim = shrink_zones(priority, zonelist, sc); - if (should_abort_reclaim) - break; + aborted_reclaim = shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from @@ -2320,8 +2319,8 @@ out: if (oom_killer_disabled) return 0; - /* Aborting reclaim to try compaction? don't OOM, then */ - if (should_abort_reclaim) + /* Aborted reclaim to try compaction? don't OOM, then */ + if (aborted_reclaim) return 1; /* top priority shrink_zones still had more to do? don't OOM, then */ -- cgit v1.1 From 03722816ec065d8c7a9306fc2d601251bc0c623e Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 10 Jan 2012 15:06:59 -0800 Subject: vmscan: promote shared file mapped pages commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. The specific workload being addressed here in described in paragraph four and while paragraph five says it did not help performance as such, it made a difference to major page faults. I'm aware of at least one bug for a large vendor that was due to increased major faults. Commit 645747462435 ("vmscan: detect mapped file pages used only once") greatly decreases lifetime of single-used mapped file pages. Unfortunately it also decreases life time of all shared mapped file pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed in fault path") page-fault handler does not mark page active or even referenced. Thus page_check_references() activates file page only if it was used twice while it stays in inactive list, meanwhile it activates anon pages after first access. Inactive list can be small enough, this way reclaimer can accidentally throw away any widely used page if it wasn't used twice in short period. After this patch page_check_references() also activate file mapped page at first inactive list scan if this page is already used multiple times via several ptes. I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5 (~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers, they share all their files but there a lot of anonymous pages: ~500mb shared file mapped memory and 15-20Gb non-shared anonymous memory. In this situation major-pagefaults are very costly, because all containers share the same page. In my load kernel created a disproportionate pressure on the file memory, compared with the anonymous, they equaled only if I raise swappiness up to 150 =) These patches actually wasn't helped a lot in my problem, but I saw noticable (10-20 times) reduce in count and average time of major-pagefault in file-mapped areas. Actually both patches are fixes for commit v2.6.33-5448-g6457474, because it was aimed at one scenario (singly used pages), but it breaks the logic in other scenarios (shared and/or executable pages) Signed-off-by: Konstantin Khlebnikov Acked-by: Pekka Enberg Acked-by: Minchan Kim Reviewed-by: KAMEZAWA Hiroyuki Cc: Wu Fengguang Cc: Johannes Weiner Cc: Nick Piggin Cc: Mel Gorman Cc: Shaohua Li Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index f8c96c7..811141e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -723,7 +723,7 @@ static enum page_references page_check_references(struct page *page, */ SetPageReferenced(page); - if (referenced_page) + if (referenced_page || referenced_ptes > 1) return PAGEREF_ACTIVATE; return PAGEREF_KEEP; -- cgit v1.1 From 4391b5f49e28bdb78ddf67495abb2f767474216d Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 10 Jan 2012 15:07:03 -0800 Subject: vmscan: activate executable pages after first usage commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once"). Currently these pages can become "first class citizens" only after second usage. After this patch page_check_references() will activate they after first usage, and executable code gets yet better chance to stay in memory. Signed-off-by: Konstantin Khlebnikov Cc: Pekka Enberg Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Cc: Wu Fengguang Cc: Johannes Weiner Cc: Nick Piggin Cc: Mel Gorman Cc: Shaohua Li Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 811141e..fed3022 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -726,6 +726,12 @@ static enum page_references page_check_references(struct page *page, if (referenced_page || referenced_ptes > 1) return PAGEREF_ACTIVATE; + /* + * Activate file-backed executable pages after first usage. + */ + if (vm_flags & VM_EXEC) + return PAGEREF_ACTIVATE; + return PAGEREF_KEEP; } -- cgit v1.1 From 503e973ce4a57bc949ddbf35d4b4ecd1a90263f8 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 10 Jan 2012 15:08:18 -0800 Subject: mm/vmscan.c: consider swap space when deciding whether to continue reclaim commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream. Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU usage on swapless systems with high anonymous memory usage. It's pointless to continue reclaiming when we have no swap space and lots of anon pages in the inactive list. Without this patch, it is possible when swap is disabled to continue trying to reclaim when there are only anonymous pages in the system even though that will not make any progress. Signed-off-by: Minchan Kim Cc: KOSAKI Motohiro Acked-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Johannes Weiner Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index fed3022..7a56981 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2008,8 +2008,9 @@ static inline bool should_continue_reclaim(struct zone *zone, * inactive lists are large enough, continue reclaiming */ pages_for_compaction = (2UL << sc->order); - inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) + - zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE); + inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE); + if (nr_swap_pages > 0) + inactive_lru_pages += zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON); if (sc->nr_reclaimed < pages_for_compaction && inactive_lru_pages > pages_for_compaction) return true; -- cgit v1.1 From d2b02236b85226c9f2cd28e32a9fdf197584e448 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 10 Jan 2012 15:08:33 -0800 Subject: mm: test PageSwapBacked in lumpy reclaim commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Even though the subject refers to lumpy reclaim, it impacts compaction as well. Lumpy reclaim does well to stop at a PageAnon when there's no swap, but better is to stop at any PageSwapBacked, which includes shmem/tmpfs too. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 7a56981..39bb416 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1199,7 +1199,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, * anon page which don't already have a swap slot is * pointless. */ - if (nr_swap_pages <= 0 && PageAnon(cursor_page) && + if (nr_swap_pages <= 0 && PageSwapBacked(cursor_page) && !PageSwapCache(cursor_page)) break; -- cgit v1.1 From 4d01a2e38a9c27d3de083ecc561b32b6a55fc7eb Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 12 Jan 2012 17:18:06 -0800 Subject: mm: vmscan: convert global reclaim to per-memcg LRU lists commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch. Stable note: Not tracked in Bugzilla. This is a partial backport of an upstream commit addressing a completely different issue that accidentally contained an important fix. The workload this patch helps was memcached when IO is started in the background. memcached should stay resident but without this patch it gets swapped. Sometimes this manifests as a drop in throughput but mostly it was observed through /proc/vmstat. Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant to fix a problem whereby small scan targets on memcg were ignored causing priority to raise too sharply. It forced scanning to take place if the target was small, memcg or kswapd. From the time it was introduced it caused excessive reclaim by kswapd with workloads being pushed to swap that previously would have stayed resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists] by making it harder for kswapd to force scan small targets but that patchset is not suitable for backporting. This was later changed again by commit [90126375: mm/vmscan: push lruvec pointer into get_scan_count()] into a format that looks like it would be a straight-forward backport but there is a subtle difference due to the use of lruvecs. The impact of the accidental fix is to make it harder for kswapd to force scan small targets by taking zone->all_unreclaimable into account. This patch is the closest equivalent available based on what is backported. Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 39bb416..6697b7a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1850,7 +1850,8 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc, unsigned long nr_force_scan[2]; /* kswapd does zone balancing and needs to scan this zone */ - if (scanning_global_lru(sc) && current_is_kswapd()) + if (scanning_global_lru(sc) && current_is_kswapd() && + zone->all_unreclaimable) force_scan = true; /* memcg may have small limit and need to avoid priority drop */ if (!scanning_global_lru(sc)) -- cgit v1.1 From 627c5c60b4ac673e9f4be758858073071684dce9 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 21 Mar 2012 16:34:11 -0700 Subject: cpuset: mm: reduce large amounts of memory barrier related damage v3 commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman Cc: Miao Xie Cc: David Rientjes Cc: Peter Zijlstra Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 6697b7a..1378487 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2251,7 +2251,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long writeback_threshold; bool aborted_reclaim; - get_mems_allowed(); delayacct_freepages_start(); if (scanning_global_lru(sc)) @@ -2314,7 +2313,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, out: delayacct_freepages_end(); - put_mems_allowed(); if (sc->nr_reclaimed) return sc->nr_reclaimed; -- cgit v1.1 From 909e0a4e5c3a9d3b60c1eecb34de75afa64ade95 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 8 Dec 2011 14:33:51 -0800 Subject: vmscan: fix initial shrinker size handling commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream. Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and go wrong] aimed to reduce excessive reclaim of slab objects but had bug in how it treated shrinker functions that returned -1. A shrinker function can return -1, means that it cannot do anything without a risk of deadlock. For example prune_super() does this if it cannot grab a superblock refrence, even if nr_to_scan=0. Currently we interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan' according to this. So the next time around this shrinker can cause really big pressure. Let's skip such shrinkers instead. Also make total_scan signed, otherwise the check (total_scan < 0) below never works. Signed-off-by: Konstantin Khlebnikov Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman --- mm/vmscan.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1378487..5326f98 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -248,12 +248,16 @@ unsigned long shrink_slab(struct shrink_control *shrink, list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; - unsigned long total_scan; - unsigned long max_pass; + long total_scan; + long max_pass; int shrink_ret = 0; long nr; long new_nr; + max_pass = do_shrinker_shrink(shrinker, shrink, 0); + if (max_pass <= 0) + continue; + /* * copy the current shrinker scan count into a local variable * and zero it so that other concurrent shrinker invocations @@ -264,7 +268,6 @@ unsigned long shrink_slab(struct shrink_control *shrink, } while (cmpxchg(&shrinker->nr, nr, 0) != nr); total_scan = nr; - max_pass = do_shrinker_shrink(shrinker, shrink, 0); delta = (4 * nr_pages_scanned) / shrinker->seeks; delta *= max_pass; do_div(delta, lru_pages + 1); -- cgit v1.1