Kafka’s High Overview Architecture: Simplifying Data Streams at Scale

Roman Glushach
8 min readJun 30, 2023

--

Kafka Architecture

Kafka is an open-source platform that was originally developed by LinkedIn in 2010 and later donated to the Apache Software Foundation. It is designed to provide a scalable, reliable, and efficient way of processing and delivering data streams in real-time.

Kafka is a distributed streaming platform that allows you to publish and subscribe to streams of events, store them in a fault-tolerant way, and process them as they occur. Kafka is designed to be horizontally scalable, meaning that you can add more nodes to the cluster without affecting the performance. Kafka is also highly available, meaning that it can tolerate failures of some nodes without losing data. Kafka is widely used by many companies, such as Netflix, LinkedIn, Uber, and Spotify, for various use cases, such as messaging, analytics, event sourcing, stream processing, and more.

Kafka can be used for various purposes, such as:

  • Messaging: Kafka can act as a message broker that decouples producers and consumers of data, allowing them to communicate asynchronously and reliably
  • Data integration: Kafka can connect different sources and sinks of data, such as databases, applications, services, and systems, and enable data transformation and enrichment along the way
  • Data processing: Kafka can support complex data processing pipelines that involve filtering, aggregation, transformation, and analysis of data streams using frameworks like Spark, Flink, or Kafka Streams
  • Data storage: Kafka can store data streams in a durable and fault-tolerant manner, using a log-based data structure that preserves the order and completeness of the data

Architecture

Kafka High Overview Architecture

Kafka’s architecture can be divided into four main layers.

Core layer

The core layer is the heart of Kafka. It consists of four main components: the topics, producers, the consumers, and the brokers.

  • Brokers are the servers that store and serve the events in Kafka topics. Brokers are also responsible for managing the replication of partitions across different brokers for fault-tolerance. Each broker has an identifier and can host one or more partitions from one or more topics. Brokers communicate with each other using a protocol called ZooKeeper, which is a distributed coordination service that maintains the metadata and configuration of Kafka clusters. ZooKeeper also elects one broker as the controller, which is responsible for handling administrative tasks such as creating or deleting topics, adding or removing brokers, assigning partitions to brokers, and monitoring broker health
  • Topics are the logical names that represent streams of events in Kafka. Topics are divided into one or more partitions, which are distributed across different brokers for scalability and reliability. Each partition has one leader replica and zero or more follower replicas. The leader replica is responsible for handling all read and write requests from producers and consumers for that partition, while the follower replicas replicate the data from the leader replica for fault-tolerance. Topics can have different configurations depending on their requirements. For example, they can have different retention policies that determine how long the events are stored in Kafka before they are deleted or compacted. They can also have different replication factors that determine how many replicas each partition has for fault-tolerance
  • Producers are the applications that publish events to Kafka topics. An event is a record of something that happened, such as a user clicking a button, a sensor reading a temperature, or a transaction being completed. An event has a key, a value, and a timestamp. Producers can specify the key and the value of the event, while the timestamp is automatically assigned by Kafka. Producers can also choose which topic and which partition to send the event to. A topic is a logical name that represents a stream of events with the same type or category. A partition is a physical unit of storage within a topic that allows Kafka to distribute the load across multiple servers. Each topic can have one or more partitions, and each partition can have one or more replicas for fault-tolerance. Producers can use different strategies to decide which partition to send the event to. For example, they can use a round-robin algorithm to balance the load evenly among all partitions, or they can use a hash function on the key of the event to ensure that events with the same key always go to the same partition. This is useful for maintaining the order of events within a partition
  • Consumers are the applications that subscribe to Kafka topics and consume the events from them. Consumers can belong to one or more consumer groups, which are logical names that represent a set of consumers that work together to consume the same topic. Each consumer group has a unique identifier and maintains its own offset for each partition. An offset is a numerical value that indicates the position of the next event to be consumed from a partition. Consumers can commit their offsets to Kafka periodically to keep track of their progress and resume from where they left off in case of failures or restarts. Consumers can use different strategies to assign partitions to themselves within a consumer group. For example, they can use a range assignment algorithm to assign partitions based on their lexicographic order, or they can use a sticky assignment algorithm to assign partitions based on their previous assignments and minimize rebalancing. Rebalancing is the process of redistributing partitions among consumers when there are changes in the consumer group membership or in the topic metadata

Storage layer

The storage layer is responsible for storing and retrieving the data in Kafka.

  • Log is a persistent and append-only data structure that stores the messages for each partition. Each message has an offset, which is a unique and monotonically increasing identifier within a partition. The messages are stored in files on disk, called segments, which are periodically rolled over to avoid large files. The segments are also compressed to save disk space and improve performance
  • Index is a data structure that maps the offsets to the physical positions of the messages in the log files. It allows Kafka to quickly locate and fetch the messages by offset. The index is stored in memory-mapped files on disk, which leverage the operating system’s cache mechanism for fast access

Processing layer

The processing layer is responsible for transforming and enriching the data in Kafka.

  • Streams are unbounded and continuous sequences of data records that represent events happening over time. They can be processed by applying various operations, such as filtering, mapping, aggregating, joining, windowing, etc. The processing logic can be expressed using either a high-level DSL (domain-specific language) or a low-level processor API
  • Tables are bounded and mutable collections of data records that represent the latest state of entities or aggregates. They can be derived from streams by applying stateful operations, such as aggregating or joining with other tables. They can also be queried by using interactive queries or push queries

Connectors layer

The connectors layer is responsible for integrating Kafka with external systems and applications.

  • Source connectors are applications that read data from external sources and write it to Kafka topics
  • Sink connectors are applications that read data from Kafka topics and write it to external destinations

Both of them can support various types of sources, such as databases, files, APIs, etc.

The connectors can be configured and managed using a REST API or a graphical user interface. They can also leverage Kafka’s features, such as scalability, fault tolerance, and exactly-once delivery.

How does Kafka work?

Now that you have a basic understanding of the Kafka components, let’s see how they work together to provide a distributed streaming platform.

  • When a producer wants to send an event to a topic, it first connects to a broker and asks for the metadata of the topic. The metadata contains information about the partitions of the topic, such as their leader and follower brokers, their offsets (positions), and their availability. The producer then chooses a partition to send the event to, based on some partitioning strategy (such as round-robin, hash-based, or custom). The producer then sends the event to the leader broker of the partition, which appends it to the end of the partition and assigns it an offset (a unique identifier). The leader broker then replicates the event to the follower brokers of the partition, which also append it to their partitions. The producer receives an acknowledgment from the leader broker when the event is successfully stored on all the replicas.
  • When a consumer wants to read an event from a topic, it first connects to a broker and subscribes to the topic. The broker assigns the consumer to a consumer group, which is a logical grouping of consumers that share the same topic subscription. The broker also assigns each partition of the topic to one consumer in the group, based on some assignment strategy (such as range-based or round-robin). The consumer then fetches the events from its assigned partitions from the leader brokers. The consumer also maintains its current offset for each partition, which indicates how far it has consumed the events. The consumer periodically commits its offsets to the broker, which allows it to resume from where it left off in case of failures or rebalances.
  • When a new broker joins or leaves the cluster, or when a broker fails or recovers, Kafka automatically triggers a rebalance process. This process involves reassigning the partitions and their replicas among the available brokers, and reassigning the partitions and their consumers among the consumer groups. The rebalance process ensures that the cluster remains balanced and resilient, and that the consumers can continue to consume the events without interruption.

Benefits of Kafka

  • High throughput: Kafka can handle millions of events per second, with low latency and high efficiency
  • Scalability: Kafka can scale horizontally by adding more brokers to the cluster, without requiring any downtime or data loss
  • Reliability: Kafka can tolerate failures of some brokers or partitions, without compromising the availability or consistency of the data
  • Durability: Kafka can store events on disk for a configurable retention period, which can be hours, days, weeks, or even forever. This allows you to replay or reprocess the events if needed
  • Flexibility: Kafka can support various types of events, such as logs, metrics, transactions, notifications, etc. Kafka can also integrate with various types of producers and consumers, such as Java, Python, Ruby, etc. Kafka can also connect with other systems, such as databases, Hadoop, Spark, etc. using connectors or APIs
  • Real-time: Kafka can enable real-time processing of events, such as streaming analytics, anomaly detection, fraud prevention, etc.

Conclusion

Kafka architecture is a powerful and flexible platform that can handle any kind of data stream in real-time. By understanding its components and best practices, you can leverage its capabilities to build scalable, reliable, and efficient data pipelines for your use cases.

--

--

Roman Glushach

Senior Software Architect & Engineer Manager at Freelance