Kubernetes Scheduling Mechanisms: Affinity, Taints, and List-Watch

Kubernetes relies on a decoupled, event-driven architecture to manage workload placement across cluster nodes. At the heart of this system lies the List-Watch mechanism, which enables components to react to state changes in real time without polling. This mechanism allows the API Server to notify components like the Scheduler, Controller Manager, and kubelet whenever resources such as Pods are created, updated, or deleted—ensuring consistent cluster state without tight coupling.

When a user submits a Pod manifest via kubectl, the API Server persists the object in etcd and emits a creation event. The Scheduler, continuously watching for Pods without a assigned nodeName, picks up this event and initiates the scheduling process. This involves two phases: Predicate (filtering) and Priorities (scoring).

Filtering Nodes with Predicate Policies

During the Predicate phase, the Scheduler evaluates candidate nodes against a set of constraints:

  • PodFitsResources: Verifies sufficient CPU, memory, and extended resources are available.
  • PodFitsHostPorts: Ensures requested host ports are not already in use.
  • PodSelectorMatches: Confirms node labels match the Pod’s selector requirements.
  • NoDiskConflict: Prevents mounting conflicting volumes unless both are read-only.

If no node passes all predicates, the Pod remains in Pending state until a suitable node becomes available.

Scoring Nodes with Priority Functions

Nodes that pass filtering are scored using priority functions to determine optimal placement:

  • LeastRequestedPriority: Favors nodes with lower resource consumption (CPU and memory).
  • BalancedResourceAllocation: Prefers nodes where CPU and memory usage are balanced.
  • ImageLocalityPriority: Rewards nodes that already have the required container images cached.

The final score for each node is a weighted sum of all priority functions, and the highest-scoring node is selected for binding.

Node Affinity: Directing Pod Placement

Node affinity allows fine-grained control over which nodes a Pod can land on, using label selectors with two modes:

Hard Affinity (required)

Pods will only schedule if the node meets all specified criteria. Failure results in Pending status.

apiVersion: v1
kind: Pod
metadata:
  name: hard-affinity-pod
spec:
  containers:
  - name: app
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-west-1a

Soft Affinity (preferred)

Nodes matching preferred rules are prioritized, but scheduling proceeds evenif none match.

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 80
      preference:
        matchExpressions:
        - key: environment
          operator: In
          values:
          - production

Pod Affinity and Anti-Affinity: Co-Location and Isolation

Unlike node affinity, Pod affinity operates based on labels of other running Pods. This enables patterns like co-locating services or avoiding single points of failure.

Pod Affinity (Co-locate)

Ensures new Pods are scheduled on nodes hosting Pods with matching labels within the same topology domain.

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - cache-server
      topologyKey: topology.kubernetes.io/zone

Pod Anti-Affinity (Isolate)

Prevents scheduling on nodes hosting Pods with specified labels, improving resilience.

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
        topologyKey: kubernetes.io/hostname

Topological domains (e.g., zones, hosts) are defined by the topologyKey, which references a node label. Nodes sharing the same value for this key are considered part of the same domain.

Label Selector Operators

Both affinity and anti-affinity rules support the following operators for label matching:

  • In, NotIn: Match against a list of values.
  • Exists, DoesNotExist: Check for presence or absence of a key.
  • Gt, Lt: Compare numeric values (e.g., memory, CPU).

Taints and Tolerations: Node-Level Exclusion

Taints are applied to nodes to repel certain Pods, while tolerations are declared by Pods to override those restrictions.

Applying Taints

kubectl taint node node01 dedicated=gpu:NoSchedule
kubectl taint node node02 critical=true:NoExecute

Three taint effects are supported:

  • NoSchedule: New Pods cannot be scheduled, but existing ones remain.
  • PreferNoSchedule: Scheduler avoids but does not forbid placement.
  • NoExecute: Existing Pods are evicted; new ones are blocked.

Defining Tolerations

A Pod can tolerate a taint by declaring a matching toleration:

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"
  tolerationSeconds: 3600

The tolerationSeconds field specifies how long a Pod remains on a node after a NoExecute taint is applied before being evicted. Using operator: Exists allows a Pod to tolerate any taint with the specified key, regardless of value.

Together, affinity rules and taints provide powerful, layered control over Pod placement—enabling use cases like dedicated hardware allocation, multi-tenant isolation, and zone-aware deployments.

Tags: kubernetes Scheduler NodeAffinity PodAffinity Taint

Posted on Sun, 24 May 2026 19:33:34 +0000 by cpd259