This article is Part 3 in a 3-Part Series.

Understanding sched_class

I’ve skipped a bunch of stuff to get here because the scheduler assignment is due soon. In this section, I will analyze struct sched_class and talk briefly about what each function does. I’ve reproduced struct sched_class below.

struct sched_class {
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*yield_task) (struct rq *rq);
	bool (*yield_to_task) (struct rq *rq, struct task_struct *p,
			       bool preempt);

	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p,
				    int flags);

	/*
	 * It is the responsibility of the pick_next_task() method that will
	 * return the next task to call put_prev_task() on the @prev task or
	 * something equivalent.
	 *
	 * May return RETRY_TASK when it finds a higher prio class has runnable
	 * tasks.
	 */
	struct task_struct * (*pick_next_task) (struct rq *rq,
						struct task_struct *prev,
						struct pin_cookie cookie);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag,
			       int flags);
	void (*migrate_task_rq)(struct task_struct *p);

	void (*task_woken) (struct rq *this_rq, struct task_struct *task);

	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask);

	void (*rq_online)(struct rq *rq);
	void (*rq_offline)(struct rq *rq);
#endif

	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
	void (*task_fork) (struct task_struct *p);
	void (*task_dead) (struct task_struct *p);

        /*
	 * The switched_from() call is allowed to drop rq->lock, therefore we
	 * cannot assume the switched_from/switched_to pair is serliazed by
	 * rq->lock. They are however serialized by p->pi_lock.
	 */
	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			      int oldprio);

	unsigned int (*get_rr_interval) (struct rq *rq,
					 struct task_struct *task);

	void (*update_curr) (struct rq *rq);

#define TASK_SET_GROUP  0
#define TASK_MOVE_GROUP 1

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*task_change_group) (struct task_struct *p, int type);
#endif
};

enqueue_task and dequeue_task

/* Called to enqueue task_struct p on runqueue rq. */
void enqueue_task(struct rq *rq, struct task_struct *p, int flags);

/* Called to dequeue task_struct p from runqueue rq. */
void dequeue_task(struct rq *rq, struct task_struct *p, int flags);

enqueue_task and dequeue_task are used to put a task on the runqueue and remove a task from the runqueue, respectively. Each of these functions are passed the task to be enqueued/dequeued, as well as the runqueue it should be added/removed from. In addition, these functions are given a bit vector of flags that describe why enqueue or dequeue is being called. Here are the various flags, which are described in sched.h:

/*
* {de,en}queue flags:
*
* DEQUEUE_SLEEP  - task is no longer runnable
* ENQUEUE_WAKEUP - task just became runnable
*
* SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks
*                are in a known state which allows modification. Such pairs
*                should preserve as much state as possible.
*
* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
*        in the runqueue.
*
* ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED  - the task was migrated during wakeup
*
*/

The flags argument can be tested using the bitwise & operation. For example, if the task was just migrated from another CPU, flags & ENQUEUE_MIGRATED evaluates to 1.

These functions are called for a variety of reasons:

When a child process is first forked, enqueue_task is called to put it on a runqueue. When a process exits, dequeue_task takes it off the runqueue.
When a process goes to sleep, dequeue_task takes it off the runqueue. For example, this happens when the process needs to wait for a lock or IO event. When the IO event occurs, or the lock becomes available, the process wakes up. It must then be re-enqueued with enqueue_task.
Process migration - if a process must be migrated from one CPU’s runqueue to another, it’s dequeued from its old runqueue and enqueued on a different one using this function.
When set_cpus_allowed is called to change the task’s processor affinity, it may need to be enqueued on a different CPU’s runqueue
When the priority of a process is boosted to avoid priority inversion. In this case, p used to have a low-priority sched_class, but is being promoted to a sched_class with high priority. This action occurs in rt_mutex_setprio.
From __sched_setscheduler. If a task’s sched_class has changed, it’s dequeued using its old sched_class and enqueued with the new one.

pick_next_task

/* Pick the task that should be currently running. */
struct task_struct *pick_next_task (struct rq *rq,
				    struct task_struct *prev,
				    struct pin_cookie cookie);

pick_next_task is called by the core scheduler to determine which of rq’s tasks should be running. The name is a bit misleading. This function is not supposed to return the task that should run after the currently running task; instead, it’s supposed to return the task_struct that should be running now, in this instant.

The kernel will context switch from the task specified by prev to the task returned by pick_next_task.

put_prev_task

/* Called right before p is going to be taken off the CPU. */
void put_prev_task(struct rq *rq, struct task_struct *p);

put_prev_task is called whenever a task is to be taken off the CPU. The behavior of this function is up to the specific sched_class. Some schedulers do very little in this function. For example, the realtime scheduler uses this function as an opportunity to perform simple bookeeping. On the other hand, CFS’s put_prev_task_fair needs to do a bit more work. As an optimization, CFS keeps the currently running task out of its RB tree. It uses the put_prev_task hook as an opportunity to put the currently running task (that is, the task specified by p) back in the RB tree.

The sched_class’s put_prev_task is called by the function put_prev_task, which is defined in sched.h. It seems a bit silly, but the sched_class’s pick_next_task is expected to call put_prev_task by itself! This is documented in the following comment:

/*
* It is the responsibility of the pick_next_task() method that will
* return the next task to call put_prev_task() on the @prev task or
* something equivalent.
*/

Note that this was not the case in prior kernels; put_prev_task used to be called by the core scheduler before it called pick_next_task.

task_tick

/* Called from the timer interrupt handler. p is the currently running task
 * and rq is the runqueue that it's on.
 */
void task_tick(struct rq *rq, struct task_struct *p, int queued);

This is one of the most important scheduler functions. It is called whenever a timer interrupt happens, and its job is to perform bookeeping and set the need_resched flag if the currently-running process needs to be preempted:

The need_resched flag can be set by the function resched_curr, found in core.c:

/* Mark rq's currently-running task to be rescheduled. */
void resched_curr(struct rq *rq)

With SMP, there’s a need_resched flag for every CPU. Thus, resched_curr might involve sending an APIC inter-processor interrupt to another processor (you don’t want to go here). The takeway is that you should just use resched_curr to set need_resched, and don’t try to do this yourself.

Note: in prior kernel versions, resched_curr used to be called resched_task.

select_task_rq

/* Returns an integer corresponding to the CPU that this task should run on */
int select_task_rq(struct task_struct *p, int task_cpu, int sd_flag, int flags);

The core scheduler invokes this function to figure out which CPU to assign a task to. This is used for distributing processes accross multiple CPUs; the core scheduler will call enqueue_task, passing the runqueue corresponding to the CPU that is returned by this function. CPU assignment obviously occurs when a process is first forked, but CPU reassignment can happen for a large variety of reasons. Here are some instances where select_task_rq is called:

When a process is first forked.
When a task is woken up after having gone to sleep.
In response to any of the syscalls in the execv family. This is an optimization, since it doesn’t hurt the cache to migrate a process that’s about to call exec.
And many more places…

You can check why select_task_rq was called by looking at sd_flag. The possible values of the flag are enumerated in sched.h:

#define SD_LOAD_BALANCE         0x0001  /* Do load balancing on this domain. */
#define SD_BALANCE_NEWIDLE      0x0002  /* Balance when about to become idle */
#define SD_BALANCE_EXEC         0x0004  /* Balance on exec */
#define SD_BALANCE_FORK         0x0008  /* Balance on fork, clone */
#define SD_BALANCE_WAKE         0x0010  /* Balance on wakeup */
#define SD_WAKE_AFFINE          0x0020  /* Wake task to waking CPU */
#define SD_SHARE_CPUCAPACITY    0x0080  /* Domain members share cpu power */
#define SD_SHARE_POWERDOMAIN    0x0100  /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES  0x0200  /* Domain members share cpu pkg resources */
#define SD_SERIALIZE            0x0400  /* Only a single load balancing instance */
#define SD_ASYM_PACKING         0x0800  /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING       0x1000  /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP              0x2000  /* sched_domains of this level overlap */
#define SD_NUMA                 0x4000  /* cross-node balancing */

For instance, sd_flag == SD_BALANCE_FORK whenever select_task_rq is called to determine the CPU of a newly forked task.

Note that select_task_rq should return a CPU that p is allowed to run on. Each task_struct has a member called cpus_allowed, of type cpumask_t. This member represents the task’s CPU affinity - i.e. which CPUs it can run on. It’s possible to iterate over these CPUs with the macro for_each_cpu, defined here.

set_curr_task

/* Called when a task changes its scheduling class or changes its task group. */
void set_curr_task(struct rq *rq);