aboutsummaryrefslogtreecommitdiffstats
path: root/mm/vmscan.c
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] find_task_by_pid() needs tasklist_lockAndrew Morton2006-03-251-0/+2
| | | | | | | | | | | | | | | | | | | | A couple of places are forgetting to take it. The kswapd case is probably unimportant. keventd_create_kthread() was racy. The whole thing is a bit flakey: you start a kernel thread, get its pid from kernel_thread() then look up its task_struct. a) It assumes that pid recycling takes a "long" time. b) We get a task_struct but no reference was taken on it. The owner of the kswapd and kthread task_struct*'s must assume that the new thread won't exit unexpectedly. Because if it does, they're left holding dead memory and any attempt to control or stop that task will crash. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] page migration reorgChristoph Lameter2006-03-221-489/+2
| | | | | | | | | | | | | | | | | | | | | | Centralize the page migration functions in anticipation of additional tinkering. Creates a new file mm/migrate.c 1. Extract buffer_migrate_page() from fs/buffer.c 2. Extract central migration code from vmscan.c 3. Extract some components from mempolicy.c 4. Export pageout() and remove_from_swap() from vmscan.c 5. Make it possible to configure NUMA systems without page migration and non-NUMA systems with page migration. I had to so some #ifdeffing in mempolicy.c that may need a cleanup. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: make shrink_all_memory try harderRafael J. Wysocki2006-03-221-0/+7
| | | | | | | | | | Make shrink_all_memory() repeat the attempts to free more memory if there seems to be no pages to free. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: emove obsolete checks from shrink_list() and fix unlikely in ↵Christoph Lameter2006-03-221-11/+2
| | | | | | | | | | | | | | | | | | | refill_inactive_zone() As suggested by Marcelo: 1. The optimization introduced recently for not calling page_referenced() during zone reclaim makes two additional checks in shrink_list unnecessary. 2. The if (unlikely(sc->may_swap)) in refill_inactive_zone is optimized for the zone_reclaim case. However, most peoples system only does swap. Undo that. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Marcelo Tosatti <marcelo.tosatti@cyclades.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: make __put_page internalNick Piggin2006-03-221-0/+2
| | | | | | | | | | Remove __put_page from outside the core mm/. It is dangerous because it does not handle compound pages nicely, and misses 1->0 transitions. If a user later appears that really needs the extra speed we can reevaluate. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: shrink_inactive_lis() nr_scan accounting fixWu Fengguang2006-03-221-4/+5
| | | | | | | | | | | In shrink_inactive_list(), nr_scan is not accounted when nr_taken is 0. But 0 pages taken does not mean 0 pages scanned. Move the goto statement below the accounting code to fix it. Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: isolate_lru_pages() scan count fixWu Fengguang2006-03-221-2/+2
| | | | | | | | | | | | In isolate_lru_pages(), *scanned reports one more scan because the scan counter is increased one more time on exit of the while-loop. Change the while-loop to for-loop to fix it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: additional comments and cleanupChristoph Lameter2006-03-221-4/+14
| | | | | | | | | | | | | | | | Add some comments to explain how zone reclaim works. And it fixes the following issues: - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write out pages to swap. Currently RECLAIM_SWAP may not do that. - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking code does not use that and the nr_reclaimed pages is just right for the intended follow up action. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: rename functionsAndrew Morton2006-03-221-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | We have: try_to_free_pages ->shrink_caches(struct zone **zones, ..) ->shrink_zone(struct zone *, ...) ->shrink_cache(struct zone *, ...) ->shrink_list(struct list_head *, ...) ->refill_inactive_list((struct zone *, ...) which is fairly irrational. Rename things so that we have try_to_free_pages ->shrink_zones(struct zone **zones, ..) ->shrink_zone(struct zone *, ...) ->shrink_inactive_list(struct zone *, ...) ->shrink_page_list(struct list_head *, ...) ->shrink_active_list(struct zone *, ...) Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan return nr_reclaimedAndrew Morton2006-03-221-39/+38
| | | | | | | | | | | | | | | | | | | | | | Change all the vmscan functions to retunr the number-of-reclaimed pages and remove scan_conrtol.nr_reclaimed. Saves ten-odd bytes of text and makes things clearer and more consistent. The patch also changes the behaviour of zone_reclaim() when it falls back to slab shrinking. Christoph says "Setting this to one means that we will rescan and shrink the slab for each allocation if we are out of zone memory and RECLAIM_SLAB is set. Plus if we do an order 0 allocation we do not go off node as intended. "We better set this to zero. This means the allocation will go offnode despite us having potentially freed lots of memory on the zone. Future allocations can then again be done from this zone." Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: use unsigned longsAndrew Morton2006-03-221-45/+59
| | | | | | | | | | | | | | Turn basically everything in vmscan.c into `unsigned long'. This is to avoid the possibility that some piece of code in there might decide to operate upon more than 4G (or even 2G) of pages in one hit. This might be silly, but we'll need it one day. Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: scan_control cleanupAndrew Morton2006-03-221-46/+62
| | | | | | | | | | Initialise as much of scan_control as possible at the declaration site. This tidies things up a bit and assures us that all unmentioned fields are zeroed out. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Thin out scan_control: remove nr_to_scan and priorityChristoph Lameter2006-03-221-34/+25
| | | | | | | | | | | Make nr_to_scan and priority a parameter instead of putting it into scan control. This allows various small optimizations and IMHO makes the code easier to read. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: simplify vmscan vs release refcountingNick Piggin2006-03-221-14/+11
| | | | | | | | | | | | | | | The VM has an interesting race where a page refcount can drop to zero, but it is still on the LRU lists for a short time. This was solved by testing a 0->1 refcount transition when picking up pages from the LRU, and dropping the refcount in that case. Instead, use atomic_add_unless to ensure we never pick up a 0 refcount page from the LRU, thus a 0 refcount page will never have its refcount elevated until it is allocated again. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: PageActive no testsetNick Piggin2006-03-221-2/+3
| | | | | | | | | PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: PageLRU no testsetNick Piggin2006-03-221-9/+11
| | | | | | | | | PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: never ClearPageLRU released pagesNick Piggin2006-03-221-7/+11
| | | | | | | | | | | | | If vmscan finds a zero refcount page on the lru list, never ClearPageLRU it. This means the release code need not hold ->lru_lock to stabilise PageLRU, so that lock may be skipped entirely when releasing !PageLRU pages (because we know PageLRU won't have been temporarily cleared by vmscan, which was previously guaranteed by holding the lock to synchronise against vmscan). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] page migration: fail if page is in a vma flagged VM_LOCKEDChristoph Lameter2006-03-141-6/+12
| | | | | | | | | | | | | | | | | page migration currently simply retries a couple of times if try_to_unmap() fails without inspecting the return code. However, SWAP_FAIL indicates that the page is in a vma that has the VM_LOCKED flag set (if ignore_refs ==1). We can check for that return code and avoid retrying the migration. migrate_page_remove_references() now needs to return a reason why the failure occured. So switch migrate_page_remove_references to use -Exx style error messages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: no zone_reclaim if PF_MALLOC is setChristoph Lameter2006-03-091-1/+2
| | | | | | | | | | | If the process has already set PF_MALLOC and is already using current->reclaim_state then do not try to reclaim memory from the zone. This is set by kswapd and/or synchrononous global reclaim which will not take it lightly if we zap the reclaim_state. Signed-off-by: Christoph Lameter <clameter@sig.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: fix zone_reclaimChristoph Lameter2006-02-241-3/+7
| | | | | | | | | | | | | - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write out pages to swap. Currently RECLAIM_SWAP may not do that. - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking code does not use that and the nr_reclaimed pages is just right for the intended follow up action. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: skip reclaim_mapped determination if we do not swapChristoph Lameter2006-02-111-34/+41
| | | | | | | | | | This puts the variables and the way to get to reclaim_mapped in one block. And allows zone_reclaim or other things to skip the determination (maybe this whole block of code does not belong into refill_inactive_zone()?) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vmscan: remove duplicate increment of reclaim_in_progressChristoph Lameter2006-02-111-2/+0
| | | | | | | | | shrink_zone() already increments reclaim_in_progress. No need to do it in balance_pgdat. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone reclaim: do not check references to a page during zone reclaimChristoph Lameter2006-02-111-1/+5
| | | | | | | | | | | shrink_list() and refill_inactive() check all ptes pointing to a page for reference bits in order to decide if the page should be put on the active list. This is not necessary for zone_reclaim since we are only interested in removing unmapped pages. Skip the checks in both functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Updates for page migrationChristoph Lameter2006-02-101-5/+20
| | | | | | | | | | | | | This adds some additional comments in order to help others figure out how exactly the code works. And fix a variable name. Also swap_page does need to ignore all reference bits when unmapping a page. Otherwise we may have to repeatedly unmap a frequently touched page. So change the try_to_unmap parameter to 1. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Direct Migration V9: Avoid writeback / page_migrate() methodChristoph Lameter2006-02-011-1/+19
| | | | | | | | | | | | | | | | | | | Migrate a page with buffers without requiring writeback This introduces a new address space operation migratepage() that may be used by a filesystem to implement its own version of page migration. A version is provided that migrates buffers attached to pages. Some filesystems (ext2, ext3, xfs) are modified to utilize this feature. The swapper address space operation are modified so that a regular migrate_page() will occur for anonymous pages without writeback (migrate_pages forces every anonymous page to have a swap entry). Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Direct Migration V9: remove_from_swap() to remove swap ptesChristoph Lameter2006-02-011-0/+9
| | | | | | | | | | | | | Add remove_from_swap remove_from_swap() allows the restoration of the pte entries that existed before page migration occurred for anonymous pages by walking the reverse maps. This reduces swap use and establishes regular pte's without the need for page faults. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Direct Migration V9: migrate_pages() extensionChristoph Lameter2006-02-011-11/+215
| | | | | | | | | | | | | | | | | | | | | | | | | Add direct migration support with fall back to swap. Direct migration support on top of the swap based page migration facility. This allows the direct migration of anonymous pages and the migration of file backed pages by dropping the associated buffers (requires writeout). Fall back to swap out if necessary. The patch is based on lots of patches from the hotplug project but the code was restructured, documented and simplified as much as possible. Note that an additional patch that defines the migrate_page() method for filesystems is necessary in order to avoid writeback for anonymous and file backed pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Reclaim slab during zone reclaimChristoph Lameter2006-02-011-0/+14
| | | | | | | | | | | | | | | | | | | If large amounts of zone memory are used by empty slabs then zone_reclaim becomes uneffective. This patch shakes the slab a bit. The problem with this patch is that the slab reclaim is not containable to a zone. Thus slab reclaim may affect the whole system and be extremely slow. This also means that we cannot determine how many pages were freed in this zone. Thus we need to go off node for at least one allocation. The functionality is disabled by default. We could modify the shrinkers to take a zone parameter but that would be quite invasive. Better ideas are welcome. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Zone reclaim: Allow modification of zone reclaim behaviorChristoph Lameter2006-02-011-2/+7
| | | | | | | | | | | | | | | | | | In some situations one may want zone_reclaim to behave differently. For example a process writing large amounts of memory will spew unto other nodes to cache the writes if many pages in a zone become dirty. This may impact the performance of processes running on other nodes. Allowing writes during reclaim puts a stop to that behavior and throttles the process by restricting the pages to the local zone. Similarly one may want to contain processes to local memory by enabling regular swap behavior during zone_reclaim. Off node memory allocation can then be controlled through memory policies and cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: configurable off node allocation period.Christoph Lameter2006-02-011-2/+2
| | | | | | | | | | | | | | | | | | | Currently the zone_reclaim code has a fixed window of 30 seconds of off node allocations should a local zone have no unused pagecache pages left. Reclaim will be attempted again after this timeout period to avoid repeated useless scans for memory. This is also useful to established sufficiently large off node allocation chunks to relieve the local node. It may be beneficial to adjust that time period for some special situations. For example if memory use was exceeding node capacity one may want to give up for longer periods of time. If memory spikes intermittendly then one may want to shorten the time period to reduce the number of off node allocations. This patch allows just that.... Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: partial scans instead of full scanChristoph Lameter2006-02-011-2/+18
| | | | | | | | | | | | | Instead of scanning all the pages in a zone, imitate real swap and scan only a portion of the pages and gradually scan more if we do not free up enough pages. This avoids a zone suddenly loosing all unused pagecache pages (we may after all access some of these again so they deserve another chance) but it still frees up large chunks of memory if a zone only contains unused pagecache pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: do not unmap file backed pagesChristoph Lameter2006-02-011-0/+6
| | | | | | | | | zone_reclaim should leave that to the real swapper. We are only interested in evicting unmapped pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: minor fixesChristoph Lameter2006-02-011-1/+3
| | | | | | | | | | | | | | - If we only reclaim nr_pages then its okay to stay on node. Switch from > to >= for the comparison. - vm_table[] entry for zone_reclaim_mode is a bit screwed up. - Add empty lines around shrink_zone to show that this is the central function to be called. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: improve function of sc->may_writepageChristoph Lameter2006-02-011-3/+3
| | | | | | | | | | | Make sc->may_writepage control the writeout behavior of shrink_list. Remove the laptop_mode trick from shrink_list and instead set may_writepage in try_to_free_pages properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone_reclaim: reclaim on memory only node supportChristoph Lameter2006-02-011-1/+7
| | | | | | | | | | | Zone reclaim is usually only run on the local node. Headless nodes do not have any local processors. This patch checks for headless nodes and performs zone reclaim on them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Optimize off-node performance of zone reclaimChristoph Lameter2006-02-011-14/+15
| | | | | | | | | | | | | | | | | | Ensure that the performance of off node pages stays the same as before. Off node pagefault tests showed an 18% drop in performance without this patch. - Increase the timeout to 30 seconds to reduce the overhead. - Move all code possible out of the off node hot path for zone reclaim (Sorry Andrew, the struct initialization had to be sacrificed). The read_page_state() bit us there. - Check first for the timeout before any other checks. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Zone reclaim: Reclaim logicChristoph Lameter2006-01-181-0/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some bits for zone reclaim exists in 2.6.15 but they are not usable. This patch fixes them up, removes unused code and makes zone reclaim usable. Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below the watermarks even if other zones still have enough pages available. Zone reclaim is of particular importance for NUMA machines. It can be more beneficial to reclaim a page than taking the performance penalties that come with allocating a page on a remote zone. Zone reclaim is enabled if the maximum distance to another node is higher than RECLAIM_DISTANCE, which may be defined by an arch. By default RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same component (enclosure or motherboard) on IA64. The meaning of the NUMA distance information seems to vary by arch. If zone reclaim is not successful then no further reclaim attempts will occur for a certain time period (ZONE_RECLAIM_INTERVAL). This patch was discussed before. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Zone reclaim: resurrect may_swapChristoph Lameter2006-01-181-0/+7
| | | | | | | | | | | | | | | Zone reclaim has a huge impact on NUMA performance (f.e. our maximum throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination of numa nodes destroys locality if one just does a large copy operation which results in performance dropping for good until reboot). This patch: Resurrect may_swap in struct scan_control Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: migration page refcounting fixNick Piggin2006-01-181-41/+30
| | | | | | | | | | | | | | | | | Migration code currently does not take a reference to target page properly, so between unlocking the pte and trying to take a new reference to the page with isolate_lru_page, anything could happen to it. Fix this by holding the pte lock until we get a chance to elevate the refcount. Other small cleanups while we're here. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SwapMig: Switch error handling in migrate_pages to use -ExxChristoph Lameter2006-01-081-22/+34
| | | | | | | | | | | Use -Exxx instead of numeric return codes and cleanup the code in migrate_pages() using -Exx error codes. Consolidate successful migration handling Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SwapMig: Extend parameters for migrate_pages()Christoph Lameter2006-01-081-9/+8
| | | | | | | | | | | | | | Extend the parameters of migrate_pages() to allow the caller control over the fate of successfully migrated or impossible to migrate pages. Swap migration and direct migration will have the same interface after this patch so that patches can be independently applied to the policy layer and the core migration code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SwapMig: Drop unused pages immediatelyChristoph Lameter2006-01-081-0/+5
| | | | | | | | | | | | Drop unused pages immediately If a page is encountered that is only referenced by the migration code then there is no reason to swap or migrate the page. Release the page by calling move_to_lru(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SwapMig: add_to_swap() avoid atomic allocationsChristoph Lameter2006-01-081-2/+2
| | | | | | | | | | | | | | | | | Add gfp_mask to add_to_swap add_to_swap does allocations with GFP_ATOMIC in order not to interfere with swapping. During migration we may have use add_to_swap extensively which may lead to out of memory errors. This patch makes add_to_swap take a parameter that specifies the gfp mask. The page migration code can then make add_to_swap use GFP_KERNEL. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SwapMig: CONFIG_MIGRATION fixesChristoph Lameter2006-01-081-76/+76
| | | | | | | | | Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by CONFIG_MIGRATION saving some codesize for single processor kernels. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration supportChristoph Lameter2006-01-081-9/+11
| | | | | | | | | | | | Include page migration if the system is NUMA or having a memory model that allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM). And: - Only include lru_add_drain_per_cpu if building for an SMP system. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Swap Migration V5: migrate_pages() functionChristoph Lameter2006-01-081-34/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the basic page migration function with a minimal implementation that only allows the eviction of pages to swap space. Page eviction and migration may be useful to migrate pages, to suspend programs or for remapping single pages (useful for faulty pages or pages with soft ECC failures) The process is as follows: The function wanting to migrate pages must first build a list of pages to be migrated or evicted and take them off the lru lists via isolate_lru_page(). isolate_lru_page determines that a page is freeable based on the LRU bit set. Then the actual migration or swapout can happen by calling migrate_pages(). migrate_pages does its best to migrate or swapout the pages and does multiple passes over the list. Some pages may only be swappable if they are not dirty. migrate_pages may start writing out dirty pages in the initial passes over the pages. However, migrate_pages may not be able to migrate or evict all pages for a variety of reasons. The remaining pages may be returned to the LRU lists using putback_lru_pages(). Changelog V4->V5: - Use the lru caches to return pages to the LRU Changelog V3->V4: - Restructure code so that applying patches to support full migration does require minimal changes. Rename swapout_pages() to migrate_pages(). Changelog V2->V3: - Extract common code from shrink_list() and swapout_pages() Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Michael Kerrisk" <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swapChristoph Lameter2006-01-081-4/+2
| | | | | | | | | | | | | Add PF_SWAPWRITE to control a processes permission to write to swap. - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd and pdflush - Set PF_SWAPWRITE flag for kswapd and pdflush Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Swap Migration V5: LRU operationsChristoph Lameter2006-01-081-13/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the start of the `swap migration' patch series. Swap migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the virtual addresses that the process sees do not change. However, the system rearranges the physical location of those pages. The main intent of page migration patches here is to reduce the latency of memory access by moving pages near to the processor where the process accessing that memory is running. The patchset allows a process to manually relocate the node on which its pages are located through the MF_MOVE and MF_MOVE_ALL options while setting a new memory policy. The pages of process can also be relocated from another process using the sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages function call takes two sets of nodes and moves pages of a process that are located on the from nodes to the destination nodes. Manual migration is very useful if for example the scheduler has relocated a process to a processor on a distant node. A batch scheduler or an administrator can detect the situation and move the pages of the process nearer to the new processor. sys_migrate_pages() could be used on non-numa machines as well, to force all of a particualr process's pages out to swap, if someone thinks that's useful. Larger installations usually partition the system using cpusets into sections of nodes. Paul has equipped cpusets with the ability to move pages when a task is moved to another cpuset. This allows automatic control over locality of a process. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically (as is the case today). Swap migration works by simply evicting the page. The pages must be faulted back in. The pages are then typically reallocated by the system near the node where the process is executing. For swap migration the destination of the move is controlled by the allocation policy. Cpusets set the allocation policy before calling sys_migrate_pages() in order to move the pages as intended. No allocation policy changes are performed for sys_migrate_pages(). This means that the pages may not faulted in to the specified nodes if no allocation policy was set by other means. The pages will just end up near the node where the fault occurred. There's another patch series in the pipeline which implements "direct migration". The direct migration patchset extends the migration functionality to avoid going through swap. The destination node of the relation is controllable during the actual moving of pages. The crutch of using the allocation policy to relocate is not necessary and the pages are moved directly to the target. Its also faster since swap is not used. And sys_migrate_pages() can then move pages directly to the specified node. Implement functions to isolate pages from the LRU and put them back later. This patch: An earlier implementation was provided by Hirokazu Takahashi <taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the memory hotplug project. From: Magnus This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap migration. Signed-off-by: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] drop-pagecacheAndrew Morton2006-01-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: page_state optNick Piggin2006-01-061-12/+15
| | | | | | | | | | | | | | Optimise page_state manipulations by introducing interrupt unsafe accessors to page_state fields. Callers must provide their own locking (either disable interrupts or not update from interrupt context). Switch over the hot callsites that can easily be moved under interrupts off sections. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>