Behavioral patterns are reusable solutions to recurring problems that arise from the interaction of components in a system. They describe how the components communicate, coordinate, and collaborate to achieve a desired outcome. Behavioral patterns are not specific to Kubernetes, but they can be applied to any distributed system that involves multiple actors and dynamic behavior.
Behavioral Patterns describe how to manage the life-cycle of Pods, which are the basic units of deployment in Kubernetes. Depending on the type of workload, a Pod might run until completion as a batch job, be scheduled to run periodically as a cron job, run as a daemon service on every node, or run as a singleton service on a specific node.
Kubernetes Resources and Controllers
- Batch Job Pattern: Manages isolated atomic units of work using the Job resource. It runs short-lived Pods reliably until completion on a distributed environment
- Bare Pod: A manually created Pod to run containers. If the node running the Pod fails, the Pod is not restarted. This method is discouraged except for development or testing purposes
- ReplicaSet: A controller used for creating and managing the lifecycle of Pods expected to run continuously (e.g., to run a web server container). It maintains a stable set of replica Pods running at any given time
- DaemonSet: A controller that runs a single Pod on every node and is used for managing platform capabilities such as monitoring, log aggregation, storage containers, and others
- Job Resource: For tasks that need to perform a predefined finite unit of work reliably and then shut down the container, Kubernetes provides the Job resource. A Kubernetes Job creates one or more Pods and ensures they run successfully
Types of Jobs
Jobs can be categorized based on the parameters
- Single Pod Jobs: Starts only one Pod and is completed as soon as the single Pod terminates successfully
- Fixed completion count Jobs: The Job is considered completed after the
.spec.completionsnumber of Pods has completed successfully
- Work queue Jobs: A work queue Job is considered completed when at least one Pod has terminated successfully and all other Pods have terminated too
- Indexed Jobs: Every Pod of the Job gets an associated index ranging from 0 to
.spec.completions - 1. The assigned index is available to the containers through the Pod annotation
batch.kubernetes.io/job-completion-indexor directly via the environment variable
Limitations and Solutions
There are limitations with Indexed Jobs. For instance, the Job’s application code cannot discover the total number of workers (i.e., the value specified in
.spec.completions) for an Indexed Job.
To overcome this:
- Hardcode the total number of Pods working on a Job into the application code
- Access the value of
.spec.completionsin your application code by copying it to an environment variable or passing it as an argument to the container command in the Job’s template specification
If you have an unlimited stream of work items to process, other controllers like
ReplicaSet are recommended for managing the Pods processing these work items.
Periodic jobs, often used for system maintenance or administrative tasks, traditionally rely on specialized scheduling software or cron.
However, these methods can be costly and difficult to maintain. Developers usually create solutions that handle both scheduling and business logic, but this can lead to high resource consumption and requires the entire application to be highly available for the scheduler to be the same.
Kubernetes CronJob offers a solution by scheduling Job resources using the cron format, allowing developers to focus on the work rather than scheduling. It’s similar to a Unix crontab line and manages a Job’s temporal aspects. Combined with other Kubernetes features, a
CronJob becomes a powerful job-scheduling system, enabling developers to focus on implementing a containerized application responsible for the business logic, with scheduling handled by the platform. However, when implementing a
CronJob container, it’s important to consider all corner and failure cases of duplicate runs, no runs, parallel runs, or cancellations.
The Daemon Service pattern in Kubernetes is used to run prioritized, infrastructure-focused Pods on specific nodes. This is primarily used by administrators to run node-specific Pods to enhance the platform’s capabilities. In software systems, a daemon is a long-running, self-recovering program that runs as a background process. In Kubernetes, a similar concept exists in the form of a
DaemonSet, which represents Pods that run on cluster nodes and provide background capabilities for the rest of the cluster.
DaemonSet is similar to
ReplicationController in that it ensures a certain number of Pods are always running. However, unlike these two, a
DaemonSet is not driven by consumer load in deciding how many Pod instances to run and where to run them. Its main purpose is to keep running a single Pod on every node or specific nodes.
Key differences in how
ReplicaSet are managed include:
- By default, a
DaemonSetplaces one Pod instance on every node. This can be controlled and limited to a subset of nodes using the
nodeSelectoror affinity fields
- A Pod created by a
nodeNamespecified, so it doesn’t require the existence of the Kubernetes scheduler to run containers
- Pods created by a
DaemonSetcan run before the scheduler has started
- The unschedulable field of a node is not respected by the
- Pods created by a
DaemonSetcan have a
RestartPolicyonly set to
Alwaysor left unspecified
- Pods managed by a
DaemonSetare supposed to run only on targeted nodes and are treated with higher priority by many controllers
DaemonSets in Kubernetes are used to run system-critical Pods on certain nodes in the cluster. From Kubernetes v1.17 onwards, DaemonSet uses the default scheduler for scheduling, improving the overall experience and bringing features like taints, tolerations, Pod priority, and preemption to
DaemonSet Pods can be accessed in several ways:
- Service: Create a Service with the same Pod selector as a
DaemonSetto reach a daemon Pod load-balanced to a random node
- DNS: Create a headless Service with the same Pod selector as a
DaemonSetto retrieve multiple A records from DNS containing all Pod IPs and ports
- Node IP with hostPort: Pods in the
DaemonSetcan specify a
hostPortand become reachable via the node IP addresses and the specified port
- External Push: The application in the
DaemonSetsPods can push data to a well-known location or service that’s external to the Pod
- Static Pods: managed by the
Kubeletonly and run on one node only. However,
DaemonSetsare better integrated with the rest of the platform and are recommended over static Pods
The Singleton Service pattern in Kubernetes is a method that ensures only one instance of an application is active at any given time, while still maintaining high availability. This pattern is particularly useful in scenarios where tasks need to be executed by a single service instance to avoid duplication or to maintain order.
There are some ways to implement this pattern: out-of-application and in-application locking. The former involves running multiple replicas of the same Pod to create an active-active topology, where all instances of a service are active. However, for the Singleton Service pattern, an active-passive topology is needed, where only one instance is active and all other instances are passive.
ReplicaSets are designed for Pod availability, not for ensuring At-Most-One semantics for Pods. This can lead to multiple copies of a Pod running concurrently in certain failure scenarios. If strong singleton guarantees are needed, consider using
StatefulSets or in-application locking options that provide more control over the leader election process.
In some cases, only a part of a containerized application needs to be a singleton. For instance, an application might have an HTTP endpoint that can be scaled to multiple instances, but also a polling component that must be a singleton. In such situations, either split the singleton component into its own deployment unit or use in-application locking to lock only the component that needs to be a singleton.
In Kubernetes, singleton instances, which are typically unaware of their constraint, can be managed by an external process. This is achieved by starting a single Pod, backed by a controller such as a
ReplicaSet, which ensures high availability. The
ReplicaSet controller ensures that at least one instance is always running, although occasionally there can be more instances. This mechanism favors availability over consistency, making it suitable for highly available and scalable distributed systems.
However, singletons typically favor consistency over availability. For strict singleton requirements,
StatefulSets might be a better choice as they provide stronger singleton guarantees but come with increased complexity. Singleton applications running in Pods on Kubernetes typically open outgoing connections to other systems. Singleton Pods in Kubernetes can accept incoming connections through the Service resource.
Regular Services create a virtual IP and perform load balancing among all matching Pod instances. Singleton Pods managed through a StatefulSet have only one Pod and a stable network identity. It’s recommended to create a headless Service (by setting both type:
clusterIP: None) for singleton Pods. Headless Services don’t have a virtual IP address, kube-proxy doesn’t handle these Services, and no proxying is performed.
Headless Services with selectors create endpoint records in the API Server and generate DNS A records for the matching Pod(s). DNS lookup for the Service returns the IP address(es) of the backing Pod(s), enabling direct access to the singleton Pod via the Service DNS record.
For nonstrict singletons with at least one instance requirement, defining a
ReplicaSet with one replica would suffice. For a strict singleton with an At-Most-One requirement and better performant service discovery, a
StatefulSet and a headless Service would be preferred. Using
StatefulSet will favor consistency and ensure there is an At-Most-One instance and occasionally none in some corner cases.
In distributed environments, service instance control is achieved through a distributed lock. When a service instance is activated, it acquires a lock and becomes active. Other instances that fail to acquire the lock wait and keep trying to get the lock in case the active service releases it. This mechanism is used in many distributed frameworks for high availability and resiliency.
For example, Apache ActiveMQ, a message broker, can run in a highly available active-passive topology where the data source provides the shared lock. The first broker instance that starts up acquires the lock and becomes active, while any other subsequently started instances become passive and wait for the lock to be released.
This strategy is similar to a Singleton in object-oriented programming: an object instance stored in a static class variable that doesn’t allow instantiation of multiple instances for the same process. In distributed systems, this means the application itself has to be written in a way that doesn’t allow more than one active instance at a time, regardless of the number of Pod instances that are started.
To achieve this in a distributed environment, we need a distributed lock implementation such as Apache ZooKeeper, HashiCorp’s Consul, Redis, or etcd. For example, ZooKeeper uses ephemeral nodes which exist as long as there is a client session and are deleted as soon as the session ends.
In Kubernetes, instead of managing a ZooKeeper cluster only for the locking feature, it would be better to use etcd capabilities exposed through the Kubernetes API and running on the main nodes. etcd is a distributed key-value store that uses the Raft protocol to maintain its replicated state and provides necessary building blocks for implementing leader election. Kubernetes offers the Lease object for node heartbeats and component-level leader election.
Kubernetes Leases are used in high-availability cluster deployments to ensure that only one control plane component, such as
kube-scheduler, is active at a time, with others on standby. Apache Camel’s Kubernetes connector provides leader election and singleton capabilities, using Kubernetes APIs to leverage
ConfigMaps as a distributed lock. This ensures that only one Camel route instance is active at a time.
Pod Disruption Budget
PodDisruptionBudget (PDB) is a feature in Kubernetes that limits the number of Pods that can be down at the same time during maintenance. It ensures that a specific number or percentage of Pods will not be voluntarily evicted from a node at any given time. This is especially useful for applications that require a minimum number of running replicas at all times, or for critical applications that should maintain a certain percentage of total instances.
For workloads that only have one instance (singleton workloads), setting
maxUnavailable to 0 or
minAvailable to 100% will prevent any voluntary eviction, effectively making the Pod unevictable. This is beneficial in scenarios where the cluster operator needs to coordinate downtime with the singleton workload owner before evicting a non-highly available Pod.
The Stateless Service pattern is a technique for building applications composed of identical, ephemeral replicas, making them suitable for dynamic cloud environments. These applications can be quickly scaled and made highly available. In a microservices architecture, each service addresses a single concern, owns its data, and has a well-defined deployment boundary. Stateless services do not maintain any internal state across service interactions but store information in external storage such as a database or message queue.
Stateless services consist of identical, replaceable instances that offload state to external permanent storage systems and use load-balancers for distributing incoming requests among themselves. In Kubernetes, the concept of Deployment is used to control how an application should be updated to the next version.
A complex distributed system comprises various services, including stateful services, short-lived jobs, and highly scalable stateless services. Stateless services are ideal for handling short-lived requests as they are composed of identical, swappable, ephemeral, and replaceable instances that can scale rapidly.
Kubernetes provides several primitives to manage such applications. However, Kubernetes doesn’t enforce any direct relationship between these building blocks. It’s the user’s responsibility to combine them to match the application nature. This includes understanding how liveness checks, ReplicaSet, readiness probes, Service definitions, PVCs, and accessMode work together.
ReplicaSet in Kubernetes is a tool that ensures a specified number of identical Pod replicas are running at all times. It can create new Pods as needed to maintain the desired count, and can manage bare Pods (those without an owner reference) that match its label selector. However, this can lead to a
ReplicaSet owning a nonidentical set of Pods and terminating existing bare Pods that exceed the declared replica count. To avoid this, it’s advised to ensure bare Pods do not have labels matching
ReplicaSet is created directly or through a Deployment, the end goal is the same: to create and maintain the desired number of identical Pod replicas. Deployments offer additional benefits such as controlling how replicas are upgraded and rolled back. The replicas are then scheduled to available nodes as per certain policies.
ReplicaSet’s role is to restart containers if needed and scale out or in when the number of replicas changes. This allows Deployment and
ReplicaSet to automate the lifecycle management of stateless applications.
Stateless applications in Kubernetes can handle new requests by any Pod, and depending on the application’s connection to other systems, a Kubernetes Service may be necessary. These services often use synchronous request/response-driven protocols like HTTP and gRPC. However, since Pod IP addresses change with every restart, it’s more efficient to use a permanent IP address provided by a Kubernetes Service.
A Kubernetes Service provides a fixed IP address that remains constant throughout its lifetime. This ensures that client requests are evenly distributed across instances and are directed to healthy Pods that are ready to accept requests.
In Kubernetes, Pods can use file storage through volumes, which come in various types and can store state. The text specifically discusses the
persistentVolumeClaim volume type, which utilizes manually or dynamically provisioned persistent storage.
A PersistentVolume (PV) is a piece of storage in a Kubernetes cluster that exists independently of any Pod that uses it. A Pod uses a PersistentVolumeClaim (PVC) to request and bind to the PV, which points to the actual durable storage. This indirect connection allows for separation of concerns and decoupling of Pod lifecycle from PV.
ReplicaSet, all Pods are identical; they share the same PVC and refer to the same PV. This is different from
StatefulSets where PVCs are created dynamically for each stateful Pod replica. This is one of the key differences between how stateless and stateful workloads are managed in Kubernetes.
Stateful applications like Apache ZooKeeper, MongoDB, Redis, or MySQL are unique and long-lived, often serving as the backbone for highly scalable stateless services.
However, they pose challenges when implemented as a distributed service with multiple instances. Kubernetes’
StatefulSets offer a solution for these applications by addressing needs such as persistent storage, networking, identity, and ordinality. This makes stateful applications first-class citizens in the cloud native world. Despite this, many legacy stateful applications are not designed for cloud native platforms.
To tackle this issue, Kubernetes allows users to implement custom controllers and model application resources through custom resource definitions and behavior through operators.
In Kubernetes, stateful applications often require dedicated persistent storage, which is managed through Persistent Volumes (PVs) and Persistent Volume Claims (PVCs).
StatefulSet, a key component in Kubernetes, creates PVCs dynamically for each Pod during both initial creation and scaling up. This allows each Pod to have its own dedicated PVC, unlike ReplicaSet which refers to predefined PVCs.
StatefulSets do not manage PVs. The storage for Pods must be pre-provisioned by an admin or dynamically provisioned by a PV provisioner based on the requested storage class.
While scaling up a
StatefulSet creates new Pods and associated PVCs, scaling down only deletes the Pods and not the PVCs or PVs to prevent data loss. If the data has been replicated or drained to other instances, you can manually delete the PVC, which then allows for PV recycling.
ReplicaSet can lead to shared storage among all Pod instances, creating a single point of failure and potential data corruption during scaling. A workaround is to have a separate
ReplicaSet for each instance, but this requires manual labor for scaling up.
In Kubernetes, a
StatefulSet is used to create Pods with a stable identity, which is determined by the StatefulSet’s name and an ordinal index. This is particularly useful for stateful applications that often need scalable persistent storage and networking. To cater to this, a headless Service (where clusterIP is set to None) is defined.
Unlike stateless Pods that are created through a
ReplicaSet and are identical, stateful Pods are unique and may need to be accessed individually. Stateful applications require a stable network identity because they store configuration details such as hostname and connection details of their peers.
Creating a Service per
replicas=1 could be a solution, but it requires manual work and doesn’t provide a stable hostname. Therefore, the use of
StatefulSets simplifies the management of stateful applications in Kubernetes.
StatefulSet is a fundamental aspect that provides predictable Pod names and identities based on its name. This identity is crucial for naming PVCs, reaching specific Pods via headless Services, and more.
The identity of each Pod can be predicted before it’s created, which can be useful for the application. Stateful applications require each instance to have its own long-lasting storage and network identity.
In contrast, a Pod created with
ReplicaSet would have a random name and wouldn’t maintain that identity across a restart.
In a distributed stateful application, each instance is unique and has a fixed position in the collection of instances, which is known as its ordinality.
This ordinality affects the sequence in which instances are scaled up or down. It’s also used for data distribution, access, and determining in-cluster behavior.
The concept of ordinality is particularly significant in
StatefulSet during scaling operations.
Stateful applications have specific needs such as stable storage, networking, identity, and ordinality. They may require a certain number of instances to be always available, and some may be sensitive to ordinality or parallel deployments. Some can tolerate duplicate instances, while others cannot.
To cater to these diverse requirements, Kubernetes allows the creation of CustomResourceDefinitions (CRDs) and Operators.
Kubernetes provides the
StatefulSet primitive for managing stateful applications. This is contrasted with the
ReplicaSet primitive used for running stateless workloads.
StatefulSet is likened to managing pets (unique servers requiring individual care), while
ReplicaSet is compared to managing cattle (identical, replaceable servers). Essentially,
StatefulSet is designed for managing unique Pods, whereas
ReplicaSet is for managing identical, replaceable Pods.
The Service Discovery pattern in Kubernetes offers a stable endpoint for service consumers to access service providers, regardless of whether they are within or outside the cluster.
Applications deployed on Kubernetes often interact with other services within the cluster or external systems. These interactions can be initiated internally or externally. Internally initiated interactions are typically performed through a polling consumer, such as an application running within a Pod connecting to a file server, message broker, or a database and start exchanging data.
However, the more common use case for Kubernetes workloads is when we have long-running services expecting external stimulus, most commonly in the form of incoming HTTP connections from other Pods within the cluster or external systems. In these cases, service consumers need a mechanism for discovering Pods that are dynamically placed by the scheduler and sometimes elastically scaled up and down.
Kubernetes implements the Service Discovery pattern through different mechanisms to track, register, and discover endpoints of dynamic Kubernetes Pods. Service discovery from outside the cluster builds on the Service abstraction and focuses on exposing the Services externally. While
NodePort provides basic exposure of Services, a highly available setup requires integration with the platform infrastructure provider.
Internal Service Discovery
In Kubernetes, each Pod in a Deployment is assigned a cluster-internal IP address. However, knowing these IP addresses in advance can be challenging for other services within different Pods that want to consume the web application endpoints.
This issue is addressed by the Kubernetes Service resource, which provides a stable entry point for a collection of Pods offering the same functionality.
A Service can be created through
kubectl expose, which assigns it a
clusterIP that is only accessible from within the Kubernetes cluster. This IP remains unchanged as long as the Service definition exists.
Other applications within the cluster can discover this dynamically allocated clusterIP in a few ways:
- Discovery through environment variables: When Kubernetes starts a Pod, its environment variables get populated with the details of all Services that exist at that moment. The application running that Pod would know the name of the Service it needs to consume and can be coded to read these environment variables
- Discovery through DNS lookup: Kubernetes runs a DNS server that all the Pods are automatically configured to use. When a new Service is created, it automatically gets a new DNS entry that all Pods can start using. The Service can be reached by a fully qualified domain name (FQDN) such as
The DNS Discovery mechanism allows all Pods to look up all Services as soon as a Service is defined, overcoming the drawbacks of the environment-variable-based mechanism. However, environment variables may still be needed to look up nonstandard or unknown port numbers.
Manual Service Discovery
In Kubernetes, a Service with a selector keeps track of Pods that are ready to serve, and this list is maintained in the endpoint resources. You can view all endpoints created for a Service using the
kubectl get endpoints hostnames command.
If you want to redirect connections to external IP addresses and ports, you can do so by not defining a selector for a Service and manually creating endpoint resources. This type of Service is only accessible within the cluster and can be used through environment variables or DNS lookup. The list of endpoints for this Service is manually maintained and typically points to IP addresses outside the cluster.
The Service allows you to add or remove selectors and point to either external or internal providers without having to delete the resource definition, which would result in a change of the Service IP address. This means that consumers of the service can continue using the same Service IP address while the actual service provider implementation is being migrated from on-premises to Kubernetes, without any impact on the client.
Service Discovery from Outside the Cluster
In Kubernetes, there are different methods to expose a Service outside of the cluster:
- NodePort Service: This method reserves a port on all nodes and forwards incoming connections to the Service. This makes the Service accessible both internally (through the virtual IP address) and externally (through a dedicated port on every node)
- LoadBalancer Service: This method builds on top of a regular Service with type ClusterIP by also opening a port on every node. However, a load balancer is still needed for client applications to pick a healthy node
- Service of type LoadBalancer: This exposes the service externally using a cloud provider’s load balancer. When such a Service is created, Kubernetes adds IP addresses to the
.statusfields, allowing an external client application to connect to the load balancer, which then selects a node and locates the Pod
However, it’s important to note that load-balancer provisioning and service discovery can vary among cloud providers. Some providers allow you to define the load-balancer address while others do not. Similarly, some offer mechanisms for preserving the source address, while others replace it with the load-balancer address. Therefore, it’s crucial to check the specific implementation provided by your chosen cloud provider.
Application Layer Service Discovery
Kubernetes Ingress is a resource that serves as a smart router and entry point to the cluster, providing HTTP-based access to Services. It enables access through externally reachable URLs, load balancing, TLS termination, and name-based virtual hosting. Its strength lies in its ability to use a single external load balancer and IP to service multiple Services, thereby reducing infrastructure costs.
Despite being the most complex service discovery mechanism on Kubernetes, it is highly useful for exposing multiple services under the same IP address, particularly when all services use the same L7 (typically HTTP) protocol.
The Self Awareness pattern in Kubernetes is beneficial for applications that require runtime information such as the Pod name, Pod IP address, and the hostname. This information, along with other static or dynamic data defined at the Pod level, can be obtained through the downward API in Kubernetes.
The downward API allows metadata about the Pod to be passed to the containers and the cluster via environment variables and files. This metadata is injected into your Pod and made available locally, eliminating the need for the application to interact with the Kubernetes API, thus keeping it Kubernetes-agnostic.
However, the downward API has a limitation in that it offers a limited number of keys that can be referenced. If an application requires more data, particularly about other resources or cluster-related metadata, it must query the API Server. This method is commonly used by applications to discover other Pods in the same namespace with certain labels or annotations. The application can then form a cluster with the discovered Pods and synchronize state.
Monitoring applications also use this technique to discover Pods of interest and start instrumenting them. Numerous client libraries are available in different languages to interact with the Kubernetes API Server to obtain more self-referring information that goes beyond what the downward API provides.
Behavioral patterns in Kubernetes provide a powerful framework for designing cloud-native applications that are scalable, reliable, and maintainable. These patterns offer reusable elements and design principles that streamline the development process and promote best practices in distributed systems.