Kubernetes Scheduling: Understanding the Math Behind the Magic

Roman Glushach
13 min readSep 7, 2023

--

Kubernetes Scheduling

Kubernetes scheduling is responsible for assigning pods (groups of one or more containers) to nodes in a cluster. The scheduler’s primary objective is to optimize resource utilization and ensure that the application runs smoothly and efficiently. It takes into account factors such as hardware capabilities, available resources, quality of service (QoS), and affinity settings.

Efficient scheduling is crucial for maximizing resource utilization, improving application performance, and ensuring high availability in a Kubernetes cluster. By intelligently assigning pods (groups of containers) to nodes (individual machines in the cluster), Kubernetes scheduling enables workload distribution, load balancing, and fault tolerance. It plays a vital role in optimizing resource allocation, minimizing bottlenecks, and providing a seamless user experience.

How Kubernetes Scheduling Works

Scheduling Process

Kubernetes scheduling is composed of two main phases: filtering and scoring. In the filtering phase, Kubernetes eliminates nodes that are not suitable for running a pod. For example, if a pod requests 2 GB of memory, Kubernetes will filter out nodes that have less than 2 GB of free memory. In the scoring phase, Kubernetes assigns a score to each remaining node based on various criteria, such as resource utilization, pod affinity, and node affinity. The node with the highest score is selected as the best fit for the pod.

The filtering and scoring phases are implemented by two types of components: predicates and priorities. Predicates are boolean functions that return true or false for each node. They are used to implement the filtering logic. Priorities are numeric functions that return a score between 0 and 10 for each node. They are used to implement the scoring logic.

Kubernetes has a default set of predicates and priorities that cover the most common use cases:

  • PodFitsResources: checks if a node has enough CPU and memory to run a pod
  • PodFitsHost: checks if a pod has a specific hostname requirement
  • PodFitsHostPorts: checks if a node has free ports for a pod’s host port mappings
  • PodSelectorMatches: checks if a pod’s node selector matches a node’s labels
  • NoDiskConflict: checks if a pod’s volume mounts conflict with any other pod’s volume mounts on the same node

Some of the default priorities are:

  • LeastRequestedPriority: favors nodes with lower resource utilization
  • BalancedResourceAllocation: favors nodes with balanced resource utilization across CPU and memory
  • NodeAffinityPriority: favors nodes that match a pod’s node affinity preferences.
  • InterPodAffinityPriority: favors nodes that match a pod’s inter-pod affinity and anti-affinity preferences
  • TaintTolerationPriority: favors nodes that have taints that are tolerated by a pod

You can also define your own custom predicates and priorities using the scheduler extender mechanism or by writing your own scheduler plugin.

Understanding the Scheduling Algorithm

Node Selection

The first step in the Kubernetes scheduling algorithm is node selection. When a new pod is created, the scheduler evaluates the available nodes in the cluster and identifies suitable candidates for hosting the pod.

This evaluation is based on several factors, including:

  • resource availability: ability of a computing system to allocate resources such as CPU, memory, storage, and network bandwidth to meet the demands of its workloads. It involves ensuring that there are sufficient resources available to run applications and services without bottlenecks or interruptions. Resource availability can be managed through techniques like resource scheduling, load balancing, and auto-scaling
  • node capacity: maximum amount of resources that a single node in a distributed system can handle. It includes factors such as processing power, memory size, storage capacity, and network interface capabilities. Understanding node capacity is important for scaling applications horizontally (adding more nodes) and vertically (increasing resources on individual nodes)
  • affinity: tendency of certain tasks or processes to co-locate or run together on the same node or set of nodes in a distributed system. This can be due to factors such as shared data access patterns, communication overhead, or performance optimization. Affinity can be used to improve application performance by minimizing network latency and maximizing resource utilization
  • anti-affinity rules: used to prevent certain tasks or processes from running together on the same node or set of nodes in a distributed system. These rules are typically used when two or more processes have conflicting resource requirements or need to maintain isolation for security or compliance reasons. For example, anti-affinity rules may be used to ensure that multiple instances of a mission-critical application do not run on the same node to avoid overloading it
  • taints and tolerations: mechanisms used in Kubernetes to manage pod placement and eviction decisions based on node conditions. A taint represents a condition on a node that makes it unsuitable for running certain pods, such as a lack of available resources or a software component failure. A toleration allows a pod to specify that it can tolerate certain taints on a node, enabling it to still run even if the node has limited resources or other issues. By using taints and tolerations, administrators can ensure that critical workloads receive priority access to suitable nodes while less sensitive workloads can still run on nodes with known limitations

Scoring and Prioritization

Once the candidate nodes are identified, the scheduler assigns a score to each node based on a set of predefined criteria.

These criteria include factors:

  • resource utilization: allocation and usage of computing resources such as CPU, memory, storage, and network bandwidth within a cluster. It is important to monitor and manage resource utilization to ensure that nodes have sufficient resources available to run applications efficiently and avoid overprovisioning or underutilization
  • node capacity: the maximum amount of resources that a node can provide to run applications. It includes factors such as the number of CPU cores, memory size, and storage capacity
  • pod affinity and anti-affinity: ability of Kubernetes to schedule pods together on the same node or separate them across different nodes based on their relationships. Affinity rules define how pods should be placed relative to each other, such as co-locating dependent pods or separating them for fault tolerance. Anti-affinity rules specify which pods should never be scheduled on the same node. Managing pod affinity and anti-affinity ensures efficient use of resources and minimizes application downtime
  • inter-pod communication requirements: exchange of data between two or more pods in a cluster. Understanding communication requirements is essential to configure networking policies, choose appropriate service discovery mechanisms, and optimize performance. Communication patterns can vary from simple HTTP requests to complex distributed tracing and service mesh architectures
  • user-defined constraints: custom limitations or restrictions that users can set on their deployments, services, or pods. These constraints can include parameters such as minimum or maximum instances, CPU or memory limits, or specific hardware requirements. By defining constraints, users can ensure that their applications operate within desired boundaries, preventing overprovisioning or overspending, and maintaining compliance with organizational policies or regulatory requirements

The scheduler then prioritizes the nodes based on their scores, with higher-scoring nodes given preference for hosting the pod.

Resource Allocation and Bin Packing

After prioritization, the scheduler performs resource allocation and bin packing to determine the optimal placement of pods on nodes.

It takes into account the resource requirements of the pods, such as CPU and memory, and ensures that the nodes have sufficient capacity to accommodate the pods.

The scheduler aims to maximize resource utilization and avoid overloading nodes, while also considering any user-defined constraints or preferences.

Pod Binding and Affinity Rules

Once a suitable node is identified for a pod, the scheduler binds the pod to that node, marking it as scheduled.

The scheduler also considers pod affinity and anti-affinity rules, which allow users to specify constraints on the placement of pods.

Affinity rules can be used to ensure that pods are co-located on the same node or spread across different nodes, based on specific requirements or dependencies.

Customizing the Scheduling Process

Node Selectors

Node selectors allow users to specify a set of key-value pairs that must match the labels assigned to nodes. By setting node selectors for pods, users can ensure that pods are scheduled only on nodes that meet specific criteria, such as hardware capabilities or geographical location.

Node Affinity and Anti-Affinity

Node affinity and anti-affinity rules provide more fine-grained control over the placement of pods on nodes. Users can define rules that specify preferences or constraints for pod placement based on node labels, such as preferring nodes with SSD storage or avoiding nodes with specific labels.

Pod Affinity and Anti-Affinity

Similar to node affinity and anti-affinity, pod affinity and anti-affinity rules allow users to define preferences and constraints for pod placement based on pod labels. These rules can be used, for example, to ensure that pods are co-located on the same node for efficient inter-pod communication or to spread pods across different failure domains for improved fault tolerance.

Resource Quotas

Resource quotas enable users to limit the amount of resources that can be consumed by pods in a namespace. By setting resource quotas, administrators can prevent overutilization of resources and ensure fair allocation of resources across different workloads.

Custom Schedulers

Kubernetes also allows users to develop and deploy custom schedulers, which implement their own scheduling logic and policies. Custom schedulers can be used to incorporate domain-specific knowledge or to address unique scheduling requirements that are not covered by the default scheduler.

Scheduling Algorithms

Random

The random scheduling algorithm, also known as the “dumb” scheduler, selects a random node to run a pod on.

This algorithm is simple but not very efficient, as it doesn’t take into account factors such as resource utilization, hardware capabilities, or network latency.

It is rarely used in production environments, but can be useful for testing purposes or when there are no specific requirements for pod placement.

Least Requested

The least requested scheduling algorithm assigns pods to nodes with the fewest number of requests.

Each node maintains a count of the number of running pods, and the algorithm checks the count before scheduling a new pod. The node with the lowest request count is selected for scheduling.

This algorithm aims to distribute the load evenly among all nodes in the cluster. However, it may not always result in optimal resource utilization, especially if some nodes have more resources than others.

Most Requested

The most requested scheduling algorithm is the opposite of the least requested algorithm. It assigns pods to nodes with the highest number of requests.

This algorithm prioritizes nodes that are already running many pods, assuming that those nodes have sufficient resources available. While this algorithm can lead to better resource utilization, it can also create bottlenecks on popular nodes and lead to underutilization of other nodes.

Balanced Resource Allocation

The balanced resource allocation scheduling algorithm aims to allocate resources equally among all nodes in the cluster.

Before scheduling a new pod, the algorithm calculates the resource utilization ratio for each node, taking into account CPU, memory, and other relevant resources. The node with the lowest utilization ratio is selected for scheduling.

This algorithm ensures that no single node becomes overloaded, but it may not always optimize resource usage due to variations in resource requirements between pods.

Bin Packing

The bin packing scheduling algorithm is inspired by the classic bin packing problem in computer science.

It works by grouping pods into bins based on their resource requirements, then assigning bins to nodes. Each bin represents a set of pods that can fit within the resource constraints of a single node. The algorithm starts by creating an empty bin for each node, then iteratively adds pods to the bins until all pods are scheduled or there are no more available nodes.

This algorithm optimizes resource utilization by minimizing waste and maximizing the number of scheduled pods. However, it can be computationally expensive and may not handle changes in resource availability well.

Earliest Deadline First (EDF)

The earliest deadline first scheduling algorithm is a variant of the rate monotonic scheduling (RMS) algorithm.

It assigns priority to pods based on their deadlines, with earlier deadlines receiving higher priority. The algorithm schedules pods in order of their deadlines, starting from the earliest. If multiple pods have the same deadline, the algorithm uses a round-robin approach to select the next pod.

EDF helps ensure timely completion of critical tasks but may cause starvation for lower-priority pods.

Rate Monotonic Scheduling (RMS)

Rate monotonic scheduling is a family of scheduling algorithms that assign priority to pods based on their rates, which represent the amount of work that can be completed per unit time.

RMS algorithms aim to distribute the total rate of incoming requests among available nodes, ensuring predictable performance and fairness.

There are several variants of RMS, including EDF, least laxity first (LLF), and proportional sharing.

Hierarchical

Hierarchical scheduling combines multiple scheduling algorithms to create a hierarchical structure.

For example, a cluster-level scheduler might use a combination of random, least requested, and most requested algorithms to allocate pods across nodes, while a node-level scheduler employs a rate monotonic algorithm to schedule containers within each node.

This approach allows for greater flexibility and adaptability to various workloads.

Machine Learning (ML)

Machine learning scheduling algorithms leverage artificial intelligence and machine learning techniques to predict and optimize resource utilization, workload patterns, and application behavior.

These algorithms can automatically adjust parameters, detect anomalies, and learn from experience to improve scheduling decisions.

Examples of ML scheduling algorithms include neural networks, decision trees, and genetic algorithms. They can be applied to various aspects of Kubernetes scheduling, such as pod placement, replica management, and resource allocation.

Types of Scheduling Criteria

Required Resources

Required resources refer to the minimum amount of resources (CPU, memory, etc.) required by a pod to run successfully.

Kubernetes ensures that these resources are always available on the node hosting the pod.

Requested Resources

Requested resources refer to the maximum amount of resources a pod can request.

Kubernetes tries to allocate these resources but doesn’t guarantee them. If there aren’t enough resources available, the pod may still run, but it may not perform optimally.

Limits

Limits refer to the maximum amount of resources a pod can consume.

Kubernetes enforces these limits to prevent a single pod from consuming all the resources on a node and starving other pods.

Quality of Service (QoS)

Quality of Service (QoS) refers to the level of service a pod requires. QoS classes are used to group pods based on their resource requirements and priorities.

Kubernetes provides three QoS classes:

  • Best Effort: default QoS class for pods that don’t require any special treatment. Pods in this class are scheduled based on the available resources and may be preempted if necessary
  • Burstable: allows pods to temporarily exceed their requested resources during periods of high demand. Once the demand subsides, the pods must relinquish any excess resources to avoid being evicted
  • Guaranteed: guarantees a pod will always have access to its requested resources. These pods are given top priority and cannot be preempted unless absolutely necessary

Mathematical Modeling and Its Relevance to Kubernetes Scheduling

Optimization Techniques Used in Kubernetes Scheduling

  • Linear Programming: method for finding the optimal solution to a problem by maximizing or minimizing a linear objective function, subject to linear equality and inequality constraints. It is widely used in Kubernetes scheduling to solve various optimization problems, such as allocating resources to containers and pods, scheduling tasks and jobs, and managing network traffic
  • Quadratic Programming: Quadratic programming is an extension of linear programming that allows for non-linear objective functions and constraints. It is often used in Kubernetes scheduling to solve more complex optimization problems, such as optimizing the placement of containers and pods across multiple nodes, taking into account factors such as network latency and resource utilization
  • Genetic Algorithms: Genetic algorithms are a type of evolutionary algorithm inspired by natural selection and genetics. They are particularly useful for solving complex optimization problems that involve multiple conflicting objectives. In Kubernetes scheduling, genetic algorithms can be used to evolve solutions that balance competing goals, such as minimizing response time and maximizing resource utilization

Machine Learning Algorithms

  • Neural Networks: class of machine learning models inspired by biological nervous systems. They can be trained to predict and classify data, and have been used in Kubernetes scheduling to forecast workloads, predict resource usage, and identify patterns in system behavior
  • Decision Trees: Decision trees are a type of machine learning model that uses tree-like structures to classify data. They have been used in Kubernetes scheduling to predict container failures, diagnose issues, and recommend actions based on historical data
  • Random Forest: Random forest is an ensemble learning technique that combines multiple decision trees to improve accuracy and reduce overfitting. It has been used in Kubernetes scheduling to build predictive models of resource usage, job completion times, and other metrics

Advanced Kubernetes Scheduling Concepts

Heterogeneous Clustering

The ability of Kubernetes to manage different types of worker nodes within the same cluster. This allows administrators to take advantage of diverse hardware configurations and optimize resource utilization.

For example, a cluster might contain a mix of high-CPU, high-memory, and low-latency nodes, each optimized for specific workloads. Heterogeneous clustering enables Kubernetes to schedule pods based on the unique resources available on each node, ensuring that workloads are deployed on the most suitable hardware.

To use heterogeneous clustering, administrators must define “node affinity” rules that specify which types of nodes a pod can be scheduled on. For instance, a node might have an attribute indicating that it has a high-CPU architecture or is located in a particular availability zone. When creating a pod, administrators can specify the required node affinity attributes, and Kubernetes will only schedule the pod on nodes that match those attributes.

Resource Sharing

It enables multiple pods to share resources such as CPU, memory, and storage. Resource sharing helps reduce resource waste and increase overall cluster efficiency. Administrators can configure resource sharing using “resource quotas” which limit the amount of resources a pod can consume.

There are 2 types of resource quotas:

  • hard quotas: guarantee that a pod cannot exceed its specified limits
  • soft quotas: allow a pod to temporarily exceed its limits if there are available resources

Administrators can also set “minimum guarantees” for pods, which reserve a minimum amount of resources for a pod regardless of other resource requests.

Inter-Pod Affinity and Anti-Affinity

Essential concepts for managing how pods interact with each other. Inter-pod affinity ensures that related pods are scheduled together on the same node, improving communication between them. On the other hand, inter-pod anti-affinity prevents certain pods from being scheduled on the same node, reducing potential conflicts.

Administrators can define affinity and anti-affinity rules using labels. Labels are key-value pairs attached to pods, services, ordeployments. To create an affinity rule, administrators can label both pods with the same identifier, telling Kubernetes to schedule them together. Similarly, to establish anti-affinity, administrators can label one pod with a unique identifier and mark others as “anti-affine” to that identifier.

When creating affinity or anti-affinity rules, administrators must consider factors like pod topology, network latency, and resource usage. They should aim to minimize network hops between communicating pods and maximize resource utilization by co-locating resource-intensive workloads.

Conclusion

Kubernetes scheduling is a complex and crucial aspect of managing containerized applications. By understanding the underlying math and logic behind the scheduling algorithm, users can optimize resource utilization, improve performance, and ensure the efficient operation of their Kubernetes clusters. Customizing the scheduling process further allows users to tailor the scheduling behavior to their specific requirements and constraints. With a solid understanding of Kubernetes scheduling, users can harness the full power of this container orchestration platform and unlock its potential for scalable and reliable application deployment.

--

--

Roman Glushach

Senior Software Architect & Engineer Manager at Freelance