Building Scalable Applications with Kubernetes: The Role of Storage and Volumes
Building scalable applications with Kubernetes requires careful consideration of various components, including storage and volumes.
Storage in Kubernetes refers to the persistent storage of data, which means that even if a pod crashes or is recreated, the stored data remains available. Volumes refer to the implementation of how data is mounted and exposed to the containers.
Container Storage Interface (CSI)
CSI is a standard that defines an interface between container orchestrators (such as Kubernetes) and storage vendors (such as AWS, Google, or Dell). The goal of CSI is to enable consistent and portable storage provisioning, management, and consumption across different container platforms and storage systems.
Before CSI, each storage vendor had to write a custom plugin for each container orchestrator they wanted to support. This resulted in a lot of duplication, inconsistency, and compatibility issues. Moreover, the storage plugins had to be shipped with the container orchestrator itself, which made it hard to update them independently.
CSI solves these problems by decoupling the storage logic from the container orchestrator. With CSI, storage vendors only need to write one plugin that implements the CSI interface, and container orchestrators only need to implement one generic driver that can communicate with any CSI plugin. This way, storage vendors can update their plugins without affecting the container orchestrator, and container orchestrators can support any storage system that has a CSI plugin.
How does CSI work?
CSI defines a set of RPC (remote procedure call) operations that a container orchestrator can invoke on a storage plugin. These operations cover the lifecycle of a volume, such as creating, attaching, detaching, mounting, unmounting, deleting, snapshotting, cloning, resizing, and so on. The container orchestrator and the storage plugin communicate via gRPC, a high-performance RPC framework.
To use CSI in Kubernetes, you need 2 components:
- CSI driver: responsible for implementing the CSI interface and communicating with the storage system
- CSI node plugin: responsible for performing the local operations on each node, such as mounting and unmounting volumes
The CSI driver and the node plugin can be deployed as separate containers or as a single container.
To provision a volume using CSI in k8s, you need to create 2 resources:
- StorageClass: defines the type of storage you want to use, such as SSD or HDD, and any other parameters that are specific to the storage system
- PersistentVolumeClaim (PVC): requests a specific amount of storage from the StorageClass. When you create a PVC, Kubernetes will dynamically create a PersistentVolume (PV) that matches your request and bind it to your PVC
To consume a volume using CSI in Kubernetes, you need to create a Pod that references your PVC. When you create a Pod, Kubernetes will attach the PV to the node where the Pod is scheduled, mount the volume to the Pod’s container(s), and make it available for your application. When you delete a Pod, Kubernetes will unmount the volume from the Pod’s container(s), detach the PV from the node, and release the PVC for reuse or deletion.
Cluster-Scoped Volumes
Pod-Scoped Volumes
StatefulSet
Volume Plugins
Storage Encryption
Storage encryption ensures that sensitive data stored within Kubernetes clusters remains secure by encrypting it at rest and in transit.
Encrypted Volumes
Regular volumes with an additional layer of encryption. The encryption occurs during volume creation using a Key Management Service (KMS) or a secret. When a pod uses an encrypted volume, its contents remain encrypted at rest and in transit between the pod and the persistent volume.
Types of encrypted volumes:
- FlexVolume: uses a KMS to create an encrypted volume and store the encryption key securely
- Container Storage Interface (CSI): leverage the encryption capabilities provided by CSI drivers
Encrypted Persistent Volumes
Can also be encrypted using the same method as encrypted volumes. However, because they are created from a PVC, the encryption process happens when creating the PVC rather than the PV. This approach allows for more flexibility regarding the type of storage used for the encrypted data.
Encrypted ConfigMaps
ConfigMaps store configuration data as plain text files. To secure sensitive data stored in ConfigMaps, Kubernetes introduced encrypted ConfigMaps. They work similarly to regular ConfigMaps but offer an additional encryption layer. The encryption occurs when writing the data to disk and decryption when reading.
Storage Snapshots
Allows taking point-in-time backups of Persistent Volumes (PVs). Snapshots capture the state of a PV at a particular moment, enabling easy recovery in case of data loss or corruption. Kubernetes snapshot mechanism creates a read-only copy of the PV data without affecting the original data.
Storage Migration
Storage migration refers to moving data between different storage systems or formats within a Kubernetes cluster. It helps optimize resource utilization, consolidate storage resources, or migrate data to newer, faster storage technologies.
Kubernetes offers various tools and techniques for executing storage migrations efficiently:
- StorageClass: defines the characteristics of a storage system, such as its capacity, access modes, and performance parameters. Administrators can create multiple StorageClasses representing different storage systems or tiers and update them as necessary
- Pod migration: during a storage migration, running pods need to be moved to another node housing the target storage. Kubernetes provides various strategies for pod migration, such as recreate, rolling updates, and drain. The choice of strategy depends on factors like downtime tolerance, application affinity, and availability requirements
- Volume attachment/detachment: k8s supports online volume attachment/detachment, allowing the migration of volumes without disrupting running applications
Multi-Node Storage
Multi-node storage enables sharing storage resources across multiple worker nodes within a Kubernetes cluster. It facilitates load balancing, increases scalability, and enhances high availability for mission-critical applications.
Shared Disk
In shared disk architecture, all worker nodes connect to a centralized storage array via a SAN (Storage Area Network). Each node can read and write data directly from the shared disk, eliminating the need for data replication. While providing excellent performance and low latency, shared disk setups require careful planning around issues like SCSI reserve/release and LUN masking.
Shared Nothing
Model distributes storage across multiple nodes, where each node has local storage. Data replication occurs between nodes to maintain data consistency. Shared nothing architectures do not face single points of failure and scale horizontally better. Popular distributed databases like MySQL Galera Cluster and PostgreSQL deployments often use this approach.
Network Attached Storage (NAS)
Dedicated storage hardware or software connected to a network, allowing clients (including Kubernetes nodes) to mount shares and access files over standard networking protocols (e.g., NFS, SMB, Gluster). NAS devices typically handle file serving, backup, archiving, and user home directories. Within Kubernetes, NAS integration usually occurs through a Persistent Volume (PV) provisioned using a NAS plugin.
Container-Attached Storage (CAS)
Storage solution tightly integrated with containers, offering improved performance and reduced overhead compared to traditional network-attached storage approaches. CAS plugins expose block devices or filesystems directly to containers, skipping the need for a separate network hop.
Common CAS implementations in Kubernetes include:
- EmptyDir: simple, ephemeral storage medium whose data lives entirely inside the pod’s container. Although suitable for non-persistent data, EmptyDir lacks persistence upon pod termination
- GCEPersistentDisk: CAS driver leveraging Google Cloud Platform’s Persistent Disks. Offers highly available, durable block storage residing outside the node’s lifecycle
- Amazon Web Services’ Elastic Block Store (EBS): provides persistent, highly available block-level storage volumes for EC2 instances
Hybrid Storage Solutions
Hybrid storage combines multiple storage types (SSD, HDD, flash) or tiers (cache, capacity) into a unified storage system. This approach seeks to balance performance, cost, and capacity considerations in modern data centers. Kubernetes supports various hybrid storage configurations through its diverse storage plugins and orchestration capabilities.
Some examples include:
- Tiered storage: combine multiple storage media with differing performance characteristics (e.g., fast SSDs for hot data, slower HDDs for cold data). Kubernetes administrators can create tiered storage classes, mapping each tier to appropriate storage media
- Cache-and-capacity separation: decouple caching (fast, lower capacity) and capacity (larger, slower) storage needs. Frequently accessed data stays cached near the application, while less frequently accessed data resides on cheaper, higher-capacity storage
- All-flash arrays: utilize solely flash-based storage systems for performance-sensitive workloads. Kubernetes supports popular all-flash array vendors like Pure Storage, NetApp, and Dell Technologies
Conclusion
Effectively managing storage is critical for any successful Kubernetes deployment. Understanding the differences between Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) is key to selecting the right tool for your storage needs. While PVs offer direct access to underlying storage devices, PVCs provide a flexible and scalable solution for requesting storage resources. Remember to follow best practices such as using PVs sparingly, planning ahead, monitoring usage, and labeling resources properly to optimize your storage strategy. With these concepts mastered, you’ll be well-equipped to tackle complex storage challenges in your Kubernetes environments.