Kubernetes Scheduler Priority Framework Deep Dive

The Kubernetes scheduler's priority phase assigns numerical scores to nodes that passed predicate filtering, enabling selection of the optimal node for pod placement. Unlike predicates—which enforce hard constraints—priority functions apply soft scoring logic, often weighted and combined across multiple algorithms.

Core Execution Flow: PrioritizeNodes

The entry point resides in pkg/scheduler/core/generic_scheduler.go:

priorityList, err := PrioritizeNodes(
    pod,
    g.cachedNodeInfoMap,
    metaPrioritiesInterface,
    g.prioritizers,
    filteredNodes,
    g.extenders,
)

This function orchestrates scoring across configured priority plugins. Its signature reveals key responsibilities:

func PrioritizeNodes(
    pod *v1.Pod,
    nodeNameToInfo map[string]*schedulercache.NodeInfo,
    meta interface{},
    priorityConfigs []algorithm.PriorityConfig,
    nodes []*v1.Node,
    extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error)

  • pod: Target workload being scheduled.
  • nodeNameToInfo: Cached node metadata including labels, resources, and running pods.
  • meta: Precomputed contextual data (e.g., affinity rules, topology hints) to avoid redundant calculations during parallel execution.
  • priorityConfigs: List of registered priority plugins, each with name, weight, and execution logic.
  • nodes: Subset of cluster nodes that passed all predicate checks.

The return type schedulerapi.HostPriorityList is a slice of HostPriority structs:

type HostPriority struct {
    Host  string
    Score int
}
type HostPriorityList []HostPriority

Each element represents one node’s raw or normalized score from a single priority plugin.

Execution Strategy: Legacy vs. Map-Reduce

Kubernetes supports two implementation patterns for priority plugins:

Legacy Function Pattern

Deprecated but retained for backward compatibility, this pattern uses a monolithic function that processes all nodes sequentially:

type PriorityFunction func(
    pod *v1.Pod,
    nodeNameToInfo map[string]*schedulercache.NodeInfo,
    nodes []*v1.Node,
) (schedulerapi.HostPriorityList, error)

An example is InterPodAffinityPriority, which computes inter-pod affinity scores by counting colocated matching pods and normalizing results into the [0,10] range using min-max scaling:

// Simplified logic
for _, node := range nodes {
    count := countMatchingAffinityPods(pod, node)
    // Normalize to [0,10]
    score := int(float64(schedulerapi.MaxPriority) * 
        float64(count-minCount)/float64(maxCount-minCount))
    result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: score})
}

Modern Map-Reduce Pattern

The preferred approach decouples computation into two phases:

  • Map: Runs per-node in parallel; accepts *v1.Pod, meta, and *schedulercache.NodeInfo; returns a single HostPriority.
  • Reduce: Runs once per plugin after all Map tasks complete; accepts the full HostPriorityList and applies normalization or aggregation.

A concrete implementation:

// Map: Node affinity scoring based on label matches
func CalculateNodeAffinityPriorityMap(
    pod *v1.Pod,
    meta interface{},
    nodeInfo *schedulercache.NodeInfo,
) (schedulerapi.HostPriority, error) {
    node := nodeInfo.Node()
    affinity := getEffectiveAffinity(pod, meta)
    var score int32

    for _, term := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
        if term.Weight == 0 { continue }
        selector, _ := v1helper.NodeSelectorRequirementsAsSelector(term.Preference.MatchExpressions)
        if selector.Matches(labels.Set(node.Labels)) {
            score += term.Weight
        }
    }

    return schedulerapi.HostPriority{
        Host:  node.Name,
        Score: int(score),
    }, nil
}

// Reduce: Normalize raw scores to [0,10]
var CalculateNodeAffinityPriorityReduce = NormalizeReduce(schedulerapi.MaxPriority, false)

func NormalizeReduce(maxPriority int, reverse bool) algorithm.PriorityReduceFunction {
    return func(_ *v1.Pod, _ interface{}, _ map[string]*schedulercache.NodeInfo, result schedulerapi.HostPriorityList) error {
        maxScore := 0
        for _, hp := range result {
            if hp.Score > maxScore { maxScore = hp.Score }
        }

        if maxScore == 0 { return nil }

        for i := range result {
            normalized := maxPriority * result[i].Score / maxScore
            if reverse { normalized = maxPriority - normalized }
            result[i].Score = normalized
        }
        return nil
    }
}

Result Aggregation

After all plugins finish (via goroutines and wait groups), final scores are computed by combining weighted outputs:

result := make(schedulerapi.HostPriorityList, len(nodes))
for i := range nodes {
    result[i] = schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0}
    for j := range priorityConfigs {
        result[i].Score += results[j][i].Score * priorityConfigs[j].Weight
    }
}

Here, results[j][i] denotes the score assigned to node i by plugin j. Weights allow administrators to tune influence—for instance, doubling the weight of NodeResourcesLeastAllocated emphasizes resource balancing over other factors.

Key Design Insights

  • Normalization is mandatory: Final scores must fall within [0,10]; unbounded intermediate values (e.g., raw match counts) are acceptable only when paired with reduction.
  • Concurrency model: Map operations execute in parallel across nodes (using workqueue.ParallelizeUntil), while Reduce runs serially per plugin.
  • Metadata reuse: The meta parameter enables expensive precomputation (e.g., parsing affinity rules) once per scheduling cycle rather than per node.
  • Extensibility: Extenders can inject custom scoring logic outside the core scheduler binary via HTTP callbacks.

Posted on Mon, 01 Jun 2026 17:06:41 +0000 by mitchell_1078