Understanding sched_class
This article is Part 3 in a 3-Part Series.
- Part 1 - Linux Scheduler Series: Introduction
- Part 2 - What is the Linux Scheduler?
- Part 3 - Understanding sched_class
Understanding sched_class
I’ve skipped a bunch of stuff to get here because the scheduler assignment is
due soon. In this section, I will analyze struct sched_class
and talk
briefly about what each function does. I’ve reproduced struct sched_class
below.
enqueue_task and dequeue_task
enqueue_task
and dequeue_task
are used to put a task on the runqueue and remove
a task from the runqueue, respectively. Each of these functions are passed the task
to be enqueued/dequeued, as well as the runqueue it should be added/removed
from. In addition, these functions are given a bit vector of flags that
describe why enqueue or dequeue is being called. Here are the various flags,
which are described in
sched.h:
The flags
argument can be tested using the bitwise &
operation. For example,
if the task was just migrated from another CPU, flags & ENQUEUE_MIGRATED
evaluates to 1.
These functions are called for a variety of reasons:
- When a child process is first forked,
enqueue_task
is called to put it on a runqueue. When a process exits,dequeue_task
takes it off the runqueue. - When a process goes to sleep,
dequeue_task
takes it off the runqueue. For example, this happens when the process needs to wait for a lock or IO event. When the IO event occurs, or the lock becomes available, the process wakes up. It must then be re-enqueued withenqueue_task
. - Process migration - if a process must be migrated from one CPU’s runqueue to another, it’s dequeued from its old runqueue and enqueued on a different one using this function.
- When set_cpus_allowed is called to change the task’s processor affinity, it may need to be enqueued on a different CPU’s runqueue
- When the priority of a process is boosted to avoid priority inversion. In this case, p used to have a low-priority sched_class, but is being promoted to a sched_class with high priority. This action occurs in rt_mutex_setprio.
- From
__sched_setscheduler
. If a task’ssched_class
has changed, it’s dequeued using its old sched_class and enqueued with the new one.
pick_next_task
pick_next_task
is called by the core scheduler to determine which of rq’s
tasks should be running. The name is a bit misleading. This function is not
supposed to return the task that should run after the currently running task;
instead, it’s supposed to return the task_struct
that should be running now,
in this instant.
The kernel will context switch from the task specified by prev
to the task
returned by pick_next_task
.
put_prev_task
put_prev_task
is called whenever a task is to be taken off the CPU. The
behavior of this function is up to the specific sched_class
. Some schedulers
do very little in this function. For example, the realtime scheduler
uses this function as an opportunity to perform simple bookeeping. On the other
hand, CFS’s put_prev_task_fair
needs to do a bit more work. As an
optimization, CFS keeps the currently running task out of its RB tree. It uses
the put_prev_task
hook as an opportunity to put the currently running task
(that is, the task specified by p
) back in the RB tree.
The sched_class’s put_prev_task
is called by the function put_prev_task
, which
is defined in sched.h.
It seems a bit silly, but the sched_class’s pick_next_task
is expected to call
put_prev_task
by itself! This is documented in the following comment:
Note that this was not the case in prior kernels; put_prev_task
used to be
called
by the core scheduler before it called pick_next_task
.
task_tick
This is one of the most important scheduler functions. It is called whenever
a timer interrupt happens, and its job is to perform bookeeping and set the need_resched
flag if the currently-running process needs to be preempted:
The need_resched
flag can be set by the function resched_curr
,
found in
core.c:
With SMP, there’s a need_resched
flag for every CPU. Thus, resched_curr
might involve sending an APIC inter-processor interrupt to another processor
(you don’t want to go here). The takeway is that you should just use
resched_curr
to set need_resched
, and don’t try to do this yourself.
Note: in prior kernel versions, resched_curr
used to be called resched_task
.
select_task_rq
The core scheduler invokes this function to figure out which CPU to assign a task
to. This is used for distributing processes accross multiple CPUs; the core
scheduler will call enqueue_task, passing the runqueue corresponding to the CPU
that is returned by this function. CPU assignment obviously occurs when a
process is first forked, but CPU reassignment can happen for a large variety of reasons.
Here are some instances where select_task_rq
is called:
- When a process is first forked.
- When a task is woken up after having gone to sleep.
- In response to any of the syscalls in the execv family. This is an optimization, since it doesn’t hurt the cache to migrate a process that’s about to call exec.
- And many more places…
You can check why select_task_rq
was called by looking at sd_flag
. The possible
values of the flag are enumerated in sched.h
:
For instance, sd_flag == SD_BALANCE_FORK
whenever select_task_rq
is called to
determine the CPU of a newly forked task.
Note that select_task_rq
should return a CPU that p
is allowed to run on.
Each task_struct
has a
member
called cpus_allowed
, of type cpumask_t
. This member represents the task’s
CPU affinity - i.e. which CPUs it can run on. It’s possible to iterate over these
CPUs with the macro for_each_cpu
, defined here.