The Completely Fair Scheduler (CFS) governs CPU allocation in the Linux kernel by tracking virtual execution time. Rather than relying on fixed time slices, CFS maintains a running average of how long each schedulable unit has consumed processor resources. The system continuously selects the entity with the lowest accumulated virtual runtime for execution, ensuring proportional fairness across competing processes.
Every processor maintains its own dedicated execution queue. Runnable entity are distributed acros these per-CPU queues. During each periodic timer interrupt, the kernel updates the active entity’s virtual runtime counter. If the running duration exceeds its allocated slice, or if its virtual runtime diverges significantly from the queue’s minimum, a reschedule flag is triggered. Actual context switching is deferred until the interrupt handler completes and the kernel evaluates the pending flag.
Key Abstractions and Queues
The scheduler operates on schedulable units rather than raw tasks. Each unit tracks its eligibility state and virtual runtime accumulation.
- Execution Unit: Represents a schedulable process or thread group. Contains flags indicating queue membership and a monotonic counter for virtual CPU time.
- Processor Queue: A per-CPU container managing all pending execution units. It holds a reference to the currently executing entity.
- Fair Queue Subsystem: Embedded within the processor queue, this subsystem organizes entities using a min-heap binary search tree. The leftmost node always represents the unit requiring the most immediate CPU access. It maintains pointers to the active entity, the tree root, and a cached reference to the minimum node for O(1) access, alongside a counter for active entities.
Core Data Structures
struct sched_ctrl {
int is_queued;
u64 virt_exec_ns;
};
struct thread_ctx {
struct sched_ctrl scheduler;
};
struct cpu_run_queue {
struct fair_sub_queue fair_q;
struct thread_ctx *active_thread;
};
struct fair_sub_queue {
struct rb_root time_tree;
struct rb_node *cached_min;
struct sched_ctrl *current_entity;
struct cpu_run_queue *parent_queue;
unsigned int active_count;
};
Interrupt-Driven Accounting
The hardware timer triggers the system tick. Deep within the interrupt processing chain, execution reaches the accounting routine responsible for updating runtime metrics.
void process_timer_tick(struct fair_sub_queue *fq)
{
record_execution_time(fq);
if (fq->active_count <= 1)
return;
evaluate_scheduling_threshold(fq, fq->current_entity);
}
The time accumulation function advances the virtual runtime of the active entity. Group scheduling logic is omitted to maintain focus on the core mechanism.
void record_execution_time(struct fair_sub_queue *fq)
{
struct sched_ctrl *active = fq->current_entity;
active->virt_exec_ns += measured_delta;
}
Preemption evaluation compares the consumed time against the ideal slice. If the threshold is breached, or if the active entity's virtual runtime lags behind the queue minimum by more than one slice, a reschedule request is queued.
void evaluate_scheduling_threshold(struct fair_sub_queue *fq, struct sched_ctrl *active)
{
if (elapsed_delta > target_slice) {
mark_reschedule_needed(fq->parent_queue);
return;
}
struct sched_ctrl *next_candidate = retrieve_min_entity(fq);
u64 runtime_gap = active->virt_exec_ns - next_candidate->virt_exec_ns;
if (runtime_gap > target_slice)
mark_reschedule_needed(fq->parent_queue);
}
The flag-setting routine is straightforward. It toggles a per-thread state bit that signals the kernel to invoke the scheduler upon returning to user or kernel space.
void mark_reschedule_needed(struct cpu_run_queue *q)
{
struct thread_ctx *target = q->active_thread;
set_thread_state(target, TIF_RESCHED_REQ);
}
Context Switch Execution
Once the interrupt stack unwinds, the kernel checks the reschedule flag. If set, the main scheduler loop activates. The current thread is prepared for suspension, and the selector identifies the next eligible unit.
void execute_schedule_cycle(void)
{
struct cpu_run_queue *rq = get_local_rq();
struct thread_ctx *prev = rq->active_thread;
prepare_prev_for_sleep(rq, prev);
struct thread_ctx *next = select_next_unit(rq);
perform_context_swap(rq, prev, next);
}
The selection chain traverses from the high-level scheduler down to the fair queue optimizer, ultimately extracting the leftmost node from the time-ordered tree.
struct sched_ctrl *retrieve_min_entity(struct fair_sub_queue *fq)
{
struct rb_node *min_node = fq->cached_min;
if (!min_node)
return NULL;
return container_of(min_node, struct sched_ctrl, tree_link);
}
The tree lookup guarantees that the unit with the lowest virtual runtime is always chosen first, maintaining proportional fairness across the run queue. Once selected, the kernel swaps stack pointers, updates memory maps, and resumes execution on the newly elected thread.