aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched_fair.c
Commit message (Collapse)AuthorAgeFilesLines
* sched: Fix vmark regression on big machinesMike Galbraith2010-01-211-1/+1
| | | | | | | | | | | | | | SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled, leading to many cache misses on large machines as we traverse looking for an idle shared cache to wake to. Change the enabler of select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Remove the cfs_rq dependency from set_task_cpu()Peter Zijlstra2009-12-161-6/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to remove the cfs_rq dependency from set_task_cpu() we need to ensure the task is cfs_rq invariant for all callsites. The simple approach is to substract cfs_rq->min_vruntime from se->vruntime on dequeue, and add cfs_rq->min_vruntime on enqueue. However, this has the downside of breaking FAIR_SLEEPERS since we loose the old vruntime as we only maintain the relative position. To solve this, we observe that we only migrate runnable tasks, we do this using deactivate_task(.sleep=0) and activate_task(.wakeup=0), therefore we can restrain the min_vruntime invariance to that state. The only other case is wakeup balancing, since we want to maintain the old vruntime we cannot make it relative on dequeue, but since we don't migrate inactive tasks, we can do so right before we activate it again. This is where we need the new pre-wakeup hook, we need to call this while still holding the old rq->lock. We could fold it into ->select_task_rq(), but since that has multiple callsites and would obfuscate the locking requirements, that seems like a fudge. This leaves the fork() case, simply make sure that ->task_fork() leaves the ->vruntime in a relative state. This covers all cases where set_task_cpu() gets called, and ensures it sees a relative vruntime. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.191697025@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Select_task_rq_fair() must honour SD_LOAD_BALANCEPeter Zijlstra2009-12-161-0/+3
| | | | | | | | | | We should skip !SD_LOAD_BALANCE domains. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.653578430@chello.nl> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Convert rq->lock to raw_spinlockThomas Gleixner2009-12-141-2/+2
| | | | | | | | | | Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* sched: Update normalized values on user updates via procChristian Ehrhardt2009-12-091-1/+10
| | | | | | | | | | | | | | | | | The normalized values are also recalculated in case the scaling factor changes. This patch updates the internally used scheduler tuning values that are normalized to one cpu in case a user sets new values via sysfs. Together with patch 2 of this series this allows to let user configured values scale (or not) to cpu add/remove events taking place later. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-4-git-send-email-ehrhardt@linux.vnet.ibm.com> [ v2: fix warning ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Make tunable scaling style configurableChristian Ehrhardt2009-12-091-0/+13
| | | | | | | | | | | | | | | | | As scaling now takes place on all kind of cpu add/remove events a user that configures values via proc should be able to configure if his set values are still rescaled or kept whatever happens. As the comments state that log2 was just a second guess that worked the interface is not just designed for on/off, but to choose a scaling type. Currently this allows none, log and linear, but more important it allwos us to keep the interface even if someone has an even better idea how to scale the values. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Fix missing sched tunable recalculation on cpu add/removeChristian Ehrhardt2009-12-091-0/+16
| | | | | | | | | | | Based on Peter Zijlstras patch suggestion this enables recalculation of the scheduler tunables in response of a change in the number of cpus. It also adds a max of eight cpus that are considered in that scaling. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Remove unnecessary RCU exclusionPeter Zijlstra2009-12-091-7/+2
| | | | | | | | | | | | | | | As Nick pointed out, and realized by myself when doing: sched: Fix balance vs hotplug race the patch: sched: for_each_domain() vs RCU is wrong, sched_domains are freed after synchronize_sched(), which means disabling preemption is enough. Reported-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Discard some old bitsPeter Zijlstra2009-12-091-3/+0
| | | | | | | | | WAKEUP_RUNNING was an experiment, not sure why that ever ended up being merged... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Clean up check_preempt_wakeup()Peter Zijlstra2009-12-091-40/+33
| | | | | | | | | Streamline the wakeup preemption code a bit, unifying the preempt path so that they all do the same. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Move update_curr() in check_preempt_wakeup() to avoid redundant callJupyung Lee2009-12-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | If a RT task is woken up while a non-RT task is running, check_preempt_wakeup() is called to check whether the new task can preempt the old task. The function returns quickly without going deeper because it is apparent that a RT task can always preempt a non-RT task. In this situation, check_preempt_wakeup() always calls update_curr() to update vruntime value of the currently running task. However, the function call is unnecessary and redundant at that moment because (1) a non-RT task can always be preempted by a RT task regardless of its vruntime value, and (2) update_curr() will be called shortly when the context switch between two occurs. By moving update_curr() in check_preempt_wakeup(), we can avoid redundant call to update_curr(), slightly reducing the time taken to wake up RT tasks. Signed-off-by: Jupyung Lee <jupyung@gmail.com> [ Place update_curr() right before the wake_preempt_entity() call, which is the only thing that relies on the updated vruntime ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Sanitize fork() handlingPeter Zijlstra2009-12-091-13/+15
| | | | | | | | | | | | | | | Currently we try to do task placement in wake_up_new_task() after we do the load-balance pass in sched_fork(). This yields complicated semantics in that we have to deal with tasks on different RQs and the set_task_cpu() calls in copy_process() and sched_fork() Rename ->task_new() to ->task_fork() and call it from sched_fork() before the balancing, this gives the policy a clear point to place the task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Protect sched_rr_get_param() access to task->sched_classThomas Gleixner2009-12-091-5/+1
| | | | | | | | | | | | | | | | sched_rr_get_param calls task->sched_class->get_rr_interval(task) without protection against a concurrent sched_setscheduler() call which modifies task->sched_class. Serialize the access with task_rq_lock(task) and hand the rq pointer into get_rr_interval() as it's needed at least in the sched_fair implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'sched/urgent' into sched/coreIngo Molnar2009-11-261-27/+47
|\ | | | | | | | | | | Merge reason: Pick up fixes that did not make it into .32.0 Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Strengthen buddies and mitigate buddy induced latenciesMike Galbraith2009-10-231-26/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp from collapsing due to wakeup preemption. It also switches LAST_BUDDY to exclusively do what it does best, namely mitigate the effects of aggressive wakeup preemption, which improves vmark throughput markedly, and restores mysql+oltp scalability. Since buddies are about scalability, enable them beginning at the point where we begin expanding sched_latency, namely sched_nr_latency. Previously, buddies were cleared aggressively, which seriously reduced their effectiveness. Not clearing aggressively however, produces a small drop in mysql+oltp throughput immediately after peak, indicating that LAST_BUDDY is actually doing some harm. This is right at the point where X on the desktop in competition with another load wants low latency service. Ergo, do not enable until we need to scale. To mitigate latency induced by buddies, or by a task just missing wakeup preemption, check latency at tick time. Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via CACHE_HOT_BUDDY. Supporting performance tests: tip = v2.6.32-rc5-1497-ga525b32 tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs (Three run averages except where noted.) vmark: ------ tip 108466 messages per second tip+ 125307 messages per second tip+x 125335 messages per second tipx 117781 messages per second 2.6.31.3 122729 messages per second mysql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08 tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47 tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18 2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86 pgsql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35 tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94 tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00 tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84 2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25 Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Do less agressive buddy clearingPeter Zijlstra2009-10-141-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yanmin reported a hackbench regression due to: > commit de69a80be32445b0a71e8e3b757e584d7beb90f7 > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Thu Sep 17 09:01:20 2009 +0200 > > sched: Stop buddies from hogging the system I really liked de69a80b, and it affecting hackbench shows I wasn't crazy ;-) So hackbench is a multi-cast, with one sender spraying multiple receivers, who in their turn don't spray back. This would be exactly the scenario that patch 'cures'. Previously we would not clear the last buddy after running the next task, allowing the sender to get back to work sooner than it otherwise ought to have been, increasing latencies for other tasks. Now, since those receivers don't poke back, they don't enforce the buddy relation, which means there's nothing to re-elect the sender. Cure this by less agressively clearing the buddy stats. Only clear buddies when they were not chosen. It should still avoid a buddy sticking around long after its served its time. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Mike Galbraith <efault@gmx.de> LKML-Reference: <1255084986.8802.46.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: Optimize branch hint in pick_next_task_fair()Tim Blechmann2009-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Branch hint profiling on my nehalem machine showed 90% incorrect branch hints: 15728471 158903754 90 pick_next_task_fair sched_fair.c 1555 Signed-off-by: Tim Blechmann <tim@klingt.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0BBBB1.2050100@klingt.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: More generic WAKE_AFFINE vs select_idle_sibling()Peter Zijlstra2009-11-131-17/+16
| | | | | | | | | | | | | | | | | | | | | | Instead of only considering SD_WAKE_AFFINE | SD_PREFER_SIBLING domains also allow all SD_PREFER_SIBLING domains below a SD_WAKE_AFFINE domain to change the affinity target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.909723612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: Cleanup select_task_rq_fair()Peter Zijlstra2009-11-131-22/+51
| | | | | | | | | | | | | | | | | | | | Clean up the new affine to idle sibling bits while trying to grok them. Should not have any function differences. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.832503781@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: Fix affinity logic in select_task_rq_fair()Mike Galbraith2009-11-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ingo Molnar reported: [ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.808000] caller is vmstat_update+0x26/0x70 [ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887 [ 26.816000] Call Trace: [ 26.820000] [<c1924a24>] ? printk+0x28/0x3c [ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110 [ 26.824000] mount used greatest stack depth: 1464 bytes left [ 26.828000] [<c111d086>] vmstat_update+0x26/0x70 [ 26.832000] [<c1086418>] worker_thread+0x188/0x310 [ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310 [ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60 [ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310 [ 26.848000] [<c108cf0c>] kthread+0x7c/0x90 [ 26.852000] [<c108ce90>] ? kthread+0x0/0x90 [ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10 [ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.864000] caller is vmstat_update+0x3c/0x70 Because this commit: a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair() broke ->cpus_allowed. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: arjan@infradead.org Cc: <stable@kernel.org> LKML-Reference: <1257415066.12867.1.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: Check for an idle shared cache in select_task_rq_fair()Mike Galbraith2009-11-041-4/+29
|/ | | | | | | | | | | | | | | | When waking affine, check for an idle shared cache, and if found, wake to that CPU/sibling instead of the waker's CPU. This improves pgsql+oltp ramp up by roughly 8%. Possibly more for other loads, depending on overlap. The trade-off is a roughly 1% peak downturn if tasks are truly synchronous. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <1256654138.17752.7.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sysctl: remove "struct file *" argument of ->proc_handlerAlexey Dobriyan2009-09-241-2/+2
| | | | | | | | | | | | | | | | | | | | | It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2009-09-211-23/+42
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Simplify sys_sched_rr_get_interval() system call sched: Fix potential NULL derference of doms_cur sched: Fix raciness in runqueue_is_locked() sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair() sched: Remove unneeded indentation in sched_fair.c::place_entity()
| * sched: Simplify sys_sched_rr_get_interval() system callPeter Williams2009-09-211-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | By removing the need for it to know details of scheduling classes. This allows PlugSched to define orthogonal scheduling classes. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair()Mike Galbraith2009-09-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While doing some testing, I pinned mplayer, only to find it following X around like a puppy. Looking at commit c88d591, I found a cpu_allowed check that went AWOL. I plugged it back in where it looks like it needs to go, and now when I say "sit, stay!", mplayer obeys again. 'c88d591 sched: Merge select_task_rq_fair() and sched_balance_self()' accidentally dropped the check, causing wake_affine() to pull pinned tasks - put it back. [ v2: use a cheaper version from Peter ] Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Remove unneeded indentation in sched_fair.c::place_entity()Mike Galbraith2009-09-181-22/+19
| | | | | | | | | | | | | | Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1253258365.22787.33.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | Merge branch 'linus' into perfcounters/coreIngo Molnar2009-09-191-153/+261
|\ \ | |/ | | | | | | | | Merge reason: Bring in tracing changes we depend on. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINEPeter Zijlstra2009-09-171-15/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL code can break out of the domain iteration early, making us miss the SD_WAKE_AFFINE bits. Fix this by continuing iteration until there is no need for a larger domain. This also cleans up the cgroup stuff a bit, but not having two update_shares() invocations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Stop buddies from hogging the systemPeter Zijlstra2009-09-171-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clear buddies more agressively. The (theoretical, haven't actually observed any of this) problem is that when we do not select either buddy in pick_next_entity() because they are too far ahead of the left-most task, we do not clear the buddies. This means that as soon as we service the left-most task, these same buddies will be tried again on the next schedule. Now if the left-most task was a pure hog, it wouldn't have done any wakeups and it wouldn't have set buddies of its own. That leads to the old buddies dominating, which would lead to bad latencies. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Add new wakeup preemption mode: WAKEUP_RUNNINGPeter Zijlstra2009-09-171-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a new wakeup preemption mode, preempt towards tasks that run shorter on avg. It sets next buddy to be sure we actually run the task we preempted for. Test results: root@twins:~# while :; do :; done & [1] 6537 root@twins:~# while :; do :; done & [2] 6538 root@twins:~# while :; do :; done & [3] 6539 root@twins:~# while :; do :; done & [4] 6540 root@twins:/home/peter# ./latt -c4 sleep 4 Entries: 48 (clients=4) Averages: ------------------------------ Max 4750 usec Avg 497 usec Stdev 737 usec root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features root@twins:/home/peter# ./latt -c4 sleep 4 Entries: 48 (clients=4) Averages: ------------------------------ Max 14 usec Avg 5 usec Stdev 3 usec Disabled by default - needs more testing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <new-submission>
| * sched: Rename flags to wake_flagsPeter Zijlstra2009-09-161-3/+3
| | | | | | | | | | | | | | | | For consistencies sake, rename the argument (again). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Clean up the load_idx selection in select_task_rq_fairPeter Zijlstra2009-09-161-19/+8
| | | | | | | | | | | | | | | | Clean up the code a little. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Optimize cgroup vs wakeup a bitPeter Zijlstra2009-09-161-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | We don't need to call update_shares() for each domain we iterate, just got the largets one. However, we should call it before wake_affine() as well, so that that can use up-to-date values too. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Implement a gentler fair-sleepers featureIngo Molnar2009-09-161-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add back FAIR_SLEEPERS and GENTLE_FAIR_SLEEPERS. FAIR_SLEEPERS is the old logic: credit sleepers with their sleep time. GENTLE_FAIR_SLEEPERS dampens this a bit: 50% of their sleep time gets credited. The hope here is to still give the benefits of fair-sleepers logic (quick wakeups, etc.) while not allow them to have 100% of their sleep time as if they were running. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Add SD_PREFER_LOCALPeter Zijlstra2009-09-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | And turn it on for NUMA and MC domains. This improves locality in balancing decisions by keeping up to capacity amount of tasks local before looking for idle CPUs. (and twice the capacity if SD_POWERSAVINGS_BALANCE is set.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Add a few SYNC hint knobs to play withPeter Zijlstra2009-09-151-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use overlap to weaken the SYNC hint, but allow it to set the hint as well. echo NO_SYNC_WAKEUP > /debug/sched_features echo SYNC_MORE > /debug/sched_features preserves pipe-test behaviour without using the WF_SYNC hint. Worth playing with on more workloads... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Add WF_FORKPeter Zijlstra2009-09-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Avoid the cache buddies from biasing the time distribution away from fork()ers. Normally the next buddy will be the preferred scheduling target, but this makes fork()s prefer to run the new child, whereas we prefer to run the parent, since that will generate more work. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Rename sync argumentsPeter Zijlstra2009-09-151-2/+4
| | | | | | | | | | | | | | | | | | | | In order to extend the functions to have more than 1 flag (sync), rename the argument to flags, and explicitly define a WF_ space for individual flags. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Rename select_task_rq() argumentPeter Zijlstra2009-09-151-7/+7
| | | | | | | | | | | | | | | | | | In order to be able to rename the sync argument, we need to rename the current flag argument. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Tweak wake_idxPeter Zijlstra2009-09-151-3/+18
| | | | | | | | | | | | | | | | | | | | When merging select_task_rq_fair() and sched_balance_self() we lost the use of wake_idx, restore that and set them to 0 to make wake balancing more aggressive. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Fix task affinity for select_task_rq_fairPeter Zijlstra2009-09-151-3/+2
| | | | | | | | | | | | | | | | | | While merging select_task_rq_fair() and sched_balance_self() I made a mistake that leads to testing the wrong task affinty. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: for_each_domain() vs RCUPeter Zijlstra2009-09-151-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | for_each_domain() uses RCU to serialize the sched_domains, except it doesn't actually use rcu_read_lock() and instead relies on disabling preemption -> FAIL. XXX: audit other sched_domain code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Weaken SD_POWERSAVINGS_BALANCEPeter Zijlstra2009-09-151-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | One of the problems of power-saving balancing is that under certain scenarios it is too slow and allows tons of real work to pile up. Avoid this by ignoring the powersave stuff when there's real work to be done. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Merge select_task_rq_fair() and sched_balance_self()Peter Zijlstra2009-09-151-170/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem with wake_idle() is that is doesn't respect things like cpu_power, which means it doesn't deal well with SMT nor the recent RT interaction. To cure this, it needs to do what sched_balance_self() does, which leads to the possibility of merging select_task_rq_fair() and sched_balance_self(). Modify sched_balance_self() to: - update_shares() when walking up the domain tree, (it only called it for the top domain, but it should have done this anyway), which allows us to remove this ugly bit from try_to_wake_up(). - do wake_affine() on the smallest domain that contains both this (the waking) and the prev (the wakee) cpu for WAKE invocations. Then use the top-down balance steps it had to replace wake_idle(). This leads to the dissapearance of SD_WAKE_BALANCE and SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE. SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective. Touch all topology bits to replace the old with new SD flags -- platforms might need re-tuning, enabling SD_BALANCE_WAKE conditionally on a NUMA distance seems like a good additional feature, magny-core and small nehalem systems would want this enabled, systems with slow interconnects would not. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Hook sched_balance_self() into sched_class::select_task_rq()Peter Zijlstra2009-09-151-1/+6
| | | | | | | | | | | | | | | | | | Rather ugly patch to fully place the sched_balance_self() code inside the fair class. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Move sched_balance_self() into sched_fair.cPeter Zijlstra2009-09-151-0/+145
| | | | | | | | | | | | | | | | | | | | | | Move the sched_balance_self() code into sched_fair.c This facilitates the merger of sched_balance_self() and sched_fair::select_task_rq(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Complete buddy switchesMike Galbraith2009-09-151-1/+2
| | | | | | | | | | | | | | | | | | Add a NEXT_BUDDY feature flag to aid in debugging. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Split WAKEUP_OVERLAPPeter Zijlstra2009-09-151-3/+4
| | | | | | | | | | | | | | | | | | It consists of two conditions, split them out in separate toggles so we can test them independently. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | perf_counter, sched: Add sched_stat_runtime tracepointIngo Molnar2009-09-131-0/+1
|/ | | | | | | | | | | | | This allows more precise tracking of how the scheduler accounts (and acts upon) a task having spent N nanoseconds of CPU time. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Fix sched::sched_stat_wait tracepoint fieldIngo Molnar2009-09-101-2/+1
| | | | | | | | | | | | | | | | | | | This weird perf trace output: cc1-9943 [001] 2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns] Is caused by setting one component field of the delta to zero a bit too early. Move it to later. ( Note, this does not affect the NEW_FAIR_SLEEPERS interactivity bug, it's just a reporting bug in essence. ) Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikos Chantziaras <realnc@arcor.de> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <4AA93D34.8040500@arcor.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>