Kubernetes relies on a decoupled, event-driven architecture to manage workload placement across cluster nodes. At the heart of this system lies the List-Watch mechanism, which enables components to react to state changes in real time without polling. This mechanism allows the API Server to notify components like the Scheduler, Controller Manager, and kubelet whenever resources such as Pods are created, updated, or deleted—ensuring consistent cluster state without tight coupling.
When a user submits a Pod manifest via kubectl, the API Server persists the object in etcd and emits a creation event. The Scheduler, continuously watching for Pods without a assigned nodeName, picks up this event and initiates the scheduling process. This involves two phases: Predicate (filtering) and Priorities (scoring).
Filtering Nodes with Predicate Policies
During the Predicate phase, the Scheduler evaluates candidate nodes against a set of constraints:
PodFitsResources: Verifies sufficient CPU, memory, and extended resources are available.PodFitsHostPorts: Ensures requested host ports are not already in use.PodSelectorMatches: Confirms node labels match the Pod’s selector requirements.NoDiskConflict: Prevents mounting conflicting volumes unless both are read-only.
If no node passes all predicates, the Pod remains in Pending state until a suitable node becomes available.
Scoring Nodes with Priority Functions
Nodes that pass filtering are scored using priority functions to determine optimal placement:
LeastRequestedPriority: Favors nodes with lower resource consumption (CPU and memory).BalancedResourceAllocation: Prefers nodes where CPU and memory usage are balanced.ImageLocalityPriority: Rewards nodes that already have the required container images cached.
The final score for each node is a weighted sum of all priority functions, and the highest-scoring node is selected for binding.
Node Affinity: Directing Pod Placement
Node affinity allows fine-grained control over which nodes a Pod can land on, using label selectors with two modes:
Hard Affinity (required)
Pods will only schedule if the node meets all specified criteria. Failure results in Pending status.
apiVersion: v1
kind: Pod
metadata:
name: hard-affinity-pod
spec:
containers:
- name: app
image: soscscs/myapp:v1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-1a
Soft Affinity (preferred)
Nodes matching preferred rules are prioritized, but scheduling proceeds evenif none match.
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: environment
operator: In
values:
- production
Pod Affinity and Anti-Affinity: Co-Location and Isolation
Unlike node affinity, Pod affinity operates based on labels of other running Pods. This enables patterns like co-locating services or avoiding single points of failure.
Pod Affinity (Co-locate)
Ensures new Pods are scheduled on nodes hosting Pods with matching labels within the same topology domain.
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache-server
topologyKey: topology.kubernetes.io/zone
Pod Anti-Affinity (Isolate)
Prevents scheduling on nodes hosting Pods with specified labels, improving resilience.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostname
Topological domains (e.g., zones, hosts) are defined by the topologyKey, which references a node label. Nodes sharing the same value for this key are considered part of the same domain.
Label Selector Operators
Both affinity and anti-affinity rules support the following operators for label matching:
In,NotIn: Match against a list of values.Exists,DoesNotExist: Check for presence or absence of a key.Gt,Lt: Compare numeric values (e.g., memory, CPU).
Taints and Tolerations: Node-Level Exclusion
Taints are applied to nodes to repel certain Pods, while tolerations are declared by Pods to override those restrictions.
Applying Taints
kubectl taint node node01 dedicated=gpu:NoSchedule
kubectl taint node node02 critical=true:NoExecute
Three taint effects are supported:
NoSchedule: New Pods cannot be scheduled, but existing ones remain.PreferNoSchedule: Scheduler avoids but does not forbid placement.NoExecute: Existing Pods are evicted; new ones are blocked.
Defining Tolerations
A Pod can tolerate a taint by declaring a matching toleration:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
tolerationSeconds: 3600
The tolerationSeconds field specifies how long a Pod remains on a node after a NoExecute taint is applied before being evicted. Using operator: Exists allows a Pod to tolerate any taint with the specified key, regardless of value.
Together, affinity rules and taints provide powerful, layered control over Pod placement—enabling use cases like dedicated hardware allocation, multi-tenant isolation, and zone-aware deployments.