aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched/rt.c
Commit message (Collapse)AuthorAgeFilesLines
* sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_listKirill Tkhai2013-06-191-66/+16
| | | | | | | | | | | | | | | | | | | | | | | [ Peter, this is based off of some of my work, I ran it though a few tests and it passed. I also reviewed it, and added my SOB as I am somewhat a co-author to it. ] Based on the patch by Steven Rostedt from previous year: https://lkml.org/lkml/2012/4/18/517 1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue. The only pullable tasks are the tasks which are pushable in their local rq, and no others. 2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected with it: nobody uses it since now. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/287571370557898@web7d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Use an accessor to read the rq clockFrederic Weisbecker2013-05-281-4/+4
| | | | | | | | | | | | | | | | Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Remove redundant update_runtime notifierNeil Zhang2013-05-281-40/+0
| | | | | | | | | | | | | | | migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Use this_rq() helperNathan Zimmer2013-05-101-1/+1
| | | | | | | | | | | | | | | | It is a few instructions more efficent to and slightly more readable to use this_rq()-> instead of cpu_rq(smp_processor_id())-> . Size comparison of kernel/sched/fair.o: text data bss dec hex filename 27972 122 26 28120 6dd8 fair.o.before 27956 122 26 28104 6dc8 fair.o.after Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1368116643-87971-1-git-send-email-nzimmer@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2013-02-191-9/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
| * sched/rt: Add a tuning knob to allow changing SCHED_RR timesliceClark Williams2013-02-071-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a /proc/sys/kernel scheduler knob named sched_rr_timeslice_ms that allows global changing of the SCHED_RR timeslice value. User visable value is in milliseconds but is stored as jiffies. Setting to 0 (zero) resets to the default (currently 100ms). Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/rt: Further simplify pick_rt_task()Kirill Tkhai2013-02-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function next_prio() has been removed and pull_rt_task() is the only user of pick_next_highest_task_rt() at the moment. pull_rt_task is not interested in p->nr_cpus_allowed, its only interest is the fact that cpu is allowed to execute p. If nr_cpus_allowed == 1, cpu != task_cpu(p) and cpu is allowed then it means that task p is in the middle of the migration techniques; the task waits until it is moved by migration thread. So, lets pull it earlier. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> CC: linux-rt-users <linux-rt-users@vger.kernel.org> Link: http://lkml.kernel.org/r/70871359644177@web16d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/rt: Do not account zero delta_exec in update_curr_rt()Kirill Tkhai2013-01-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several places of consecutive calls of dequeue_task_rt() and put_prev_task_rt() in the scheduler. For example, function rt_mutex_setprio() does it. The both calls lead to update_curr_rt(), the second of it receives zeroed delta_exec. The only effective action in this case is call of sched_rt_avg_update(), which can change rq->age_stamp and rq->rt_avg. But it is possible in case of ""floating"" rq->clock. This fact is not reasonable to be accounted. Another actions do nothing. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> CC: linux-rt-users <linux-rt-users@vger.kernel.org> Link: http://lkml.kernel.org/r/931541359550236@web1g.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/rt: Avoid updating RT entry timeout twice within one tick periodYing Xue2013-01-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, it means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Fan Du <fan.du@windriver.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/rt: Add reschedule check to switched_from_rt()Kirill Tkhai2013-01-241-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Reschedule rq->curr if the first RT task has just been pulled to the rq. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tkhai Kirill <tkhai@yandex.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | sched/rt: Use root_domain of rt_rq not current processorShawn Bohrer2013-01-251-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Mike Galbraith <bitbucket@online.de> Cc: peterz@infradead.org Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSWPeter Zijlstra2012-09-131-5/+0
| | | | | | | | | | | | Now that the last architecture to use this has stopped doing so (ARM, thanks Catalin!) we can remove this complexity from the scheduler core. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Unthrottle rt runqueues in __disable_runtime()Peter Boonstoppel2012-09-041-0/+1
| | | | | | | | | | | | | | | | | | migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttledMike Galbraith2012-08-131-0/+13
| | | | | | | | | | | Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/rt: Fix lockdep annotation within find_lock_lowest_rq()Peter Zijlstra2012-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Roland Dreier reported spurious, hard to trigger lockdep warnings within the scheduler - without any real lockup. This bit gives us the right clue: > [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90 > [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290 if you look at that code you'll find the double_lock_balance() in question is the one in find_lock_lowest_rq() [yay for inlining]. Now find_lock_lowest_rq() has a bug.. it fails to use double_unlock_balance() in one exit path, if this results in a retry in push_rt_task() we'll call double_lock_balance() again, at which point we'll run into said lockdep confusion. Reported-by: Roland Dreier <roland@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/rt: Fix SCHED_RR across cgroupsColin Cross2012-05-301-5/+10
| | | | | | | | | | | | | | | | | | | | task_tick_rt() has an optimization to only reschedule SCHED_RR tasks if they were the only element on their rq. However, with cgroups a SCHED_RR task could be the only element on its per-cgroup rq but still be competing with other SCHED_RR tasks in its parent's cgroup. In this case, the SCHED_RR task in the child cgroup would never yield at the end of its timeslice. If the child cgroup rt_runtime_us was the same as the parent cgroup rt_runtime_us, the task in the parent cgroup would starve completely. Modify task_tick_rt() to check that the task is the only task on its rq, and that the each of the scheduling entities of its ancestors is also the only entity on its rq. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'Peter Zijlstra2012-05-301-15/+21
| | | | | | | | | Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'tip/sched/core' of ↵Ingo Molnar2012-04-141-30/+26
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into sched/core Pull a scheduler optimization commit from Steven Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in ↵Kirill Tkhai2012-04-121-30/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | set_cpus_allowed_rt() Migration status depends on a difference of weight from 0 and 1. If weight > 1 (<= 1) and old weight <= 1 (> 1) then task becomes pushable (or not pushable). We are not insterested in its exact values, is it 3 or 4, for example. Now if we are changing affinity from a set of 3 cpus to a set of 4, the- task will be dequeued and enqueued sequentially without important difference in comparison with initial state. The only difference is in internal representation of plist queue of pushable tasks and the fact that the task may won't be the first in a sequence of the same priority tasks. But it seems to me it gives nothing. Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tkhai Kirill <tkhai@yandex.ru> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | sched/rt: Improve pick_next_highest_task_rt()Michael J Wang2012-03-271-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | Avoid extra work by continuing on to the next rt_rq if the highest prio task in current rt_rq is the same priority as our candidate task. More detailed explanation: if next is not NULL, then we have found a candidate task, and its priority is next->prio. Now we are looking for an even higher priority task in the other rt_rq's. idx is the highest priority in the current candidate rt_rq. In the current 3.3 code, if idx is equal to next->prio, we would start scanning the tasks in that rt_rq and replace the current candidate task with a task from that rt_rq. But the new task would only have a priority that is equal to our previous candidate task, so we have not advanced our goal of finding a higher prio task. So we should avoid the extra work by continuing on to the next rt_rq if idx is equal to next->prio. Signed-off-by: Michael J Wang <mjwang@broadcom.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* printk/sched: Introduce special printk_sched() for those awkward momentsPeter Zijlstra2012-03-121-1/+7
| | | | | | | | | | | | | | | | | There's a few awkward printk()s inside of scheduler guts that people prefer to keep but really are rather deadlock prone. Fudge around it by storing the text in a per-cpu buffer and poll it using the existing printk_tick() handler. This will drop output when its more frequent than once a tick, however only the affinity thing could possible go that fast and for that just one should suffice to notify the admin he's done something silly.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched/rt: Do not throttle when PI boostingPeter Zijlstra2012-03-011-2/+18
| | | | | | | | | | | | | | When a runqueue has rt_runtime_us = 0 then the only way it can accumulate rt_time is via PI boosting. That causes the runqueue to be throttled and replenishing does not change anything due to rt_runtime_us = 0. So avoid that situation by clearing rt_time and skip the throttling alltogether. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [ Changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched/rt: Keep period timer ticking when rt throttling is activePeter Zijlstra2012-03-011-5/+8
| | | | | | | | | | | | | | When a runqueue is throttled we cannot disable the period timer because that timer is the only way to undo the throttling. We got stale throttling entries when a rq was throttled and then the global sysctl was disabled, which stopped the timer. Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Added changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Make initial SCHED_RR timeslace DEF_TIMESLICEHiroshi Shimamoto2012-02-221-2/+2
| | | | | | | | | | | | | Current the initial SCHED_RR timeslice of init_task is HZ, which means 1s, and is not same as the default SCHED_RR timeslice DEF_TIMESLICE. Change that initial timeslice to the DEF_TIMESLICE. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> [ s/DEF_TIMESLICE/RR_TIMESLICE/g ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4F3C9995.3010800@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSWChanho Min2012-01-271-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This issue happens under the following conditions: 1. preemption is off 2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined 3. RT scheduling class 4. SMP system Sequence is as follows: 1.suppose current task is A. start schedule() 2.task A is enqueued pushable task at the entry of schedule() __schedule prev = rq->curr; ... put_prev_task put_prev_task_rt enqueue_pushable_task 4.pick the task B as next task. next = pick_next_task(rq); 3.rq->curr set to task B and context_switch is started. rq->curr = next; 4.At the entry of context_swtich, release this cpu's rq->lock. context_switch prepare_task_switch prepare_lock_switch raw_spin_unlock_irq(&rq->lock); 5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context 6.try_to_wake_up() which called by ISR acquires rq->lock try_to_wake_up ttwu_remote rq = __task_rq_lock(p) ttwu_do_wakeup(rq, p, wake_flags); task_woken_rt 7.push_rt_task picks the task A which is enqueued before. task_woken_rt push_rt_tasks(rq) next_task = pick_next_pushable_task(rq) 8.At find_lock_lowest_rq(), If double_lock_balance() returns 0, lowest_rq can be the remote rq. (But,If preemption is on, double_lock_balance always return 1 and it does't happen.) push_rt_task find_lock_lowest_rq if (double_lock_balance(rq, lowest_rq)).. 9.find_lock_lowest_rq return the available rq. task A is migrated to the remote cpu/rq. push_rt_task ... deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); 10. But, task A is on irq context at this cpu. So, task A is scheduled by two cpus at the same time until restore from IRQ. Task A's stack is corrupted. To fix it, don't migrate an RT task if it's still running. Signed-off-by: Chanho Min <chanho.min@lge.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched/rt: Code cleanup, remove a redundant function callShan Hai2011-12-061-1/+1
| | | | | | | | | | | The second call to sched_rt_period() is redundant, because the value of the rt_runtime was already read and it was protected by the ->rt_runtime_lock. Signed-off-by: Shan Hai <haishan.bai@gmail.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322535836-13590-2-git-send-email-haishan.bai@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Use rt.nr_cpus_allowed to recover select_task_rq() cyclesMike Galbraith2011-12-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rt.nr_cpus_allowed is always available, use it to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. See the line marked with '*' below: # taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Move all scheduler bits into kernel/sched/Peter Zijlstra2011-11-171-0/+2045
There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>