How to Design and Implement Scalable Kafka Architecture for High-Performance Streaming Applications

Roman Glushach
10 min readJun 27, 2023

--

Kafka Architecture

Kafka is a distributed, scalable, elastic, and fault-tolerant event-streaming platform that enables you to process large volumes of data in real time. Kafka Streams is a library that simplifies application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity.

Architecture

Kafka architecture consists of two main layers:

  • storage layer: is responsible for storing and replicating the events that are produced and consumed by the applications. The storage layer is based on the abstraction of a distributed commit log, which is a data structure that stores a sequence of records in a persistent and fault-tolerant way. A record in Kafka is a key-value pair that also contains a timestamp and optional headers. A record can represent any type of event, such as a customer order, a payment, a click on a website, or a sensor reading
  • compute layer: is responsible for processing and transforming the events as they arrive or depart from the system. The compute layer consists of four core components: the producer API, the consumer API, the streams API, and the connector API. These components allow you to interact with the storage layer and perform various operations on the events

A Kafka topic is a logical name for a stream of records that share a common theme or category. For example, you can have a topic named “/orders” that contains all the events related to orders in your system. A topic can be divided into multiple partitions, which are the units of parallelism and scalability in Kafka. Each partition is an ordered and immutable sequence of records that is assigned to one or more brokers (servers) in the cluster. A partition can have a leader broker and zero or more follower brokers that replicate the data from the leader for fault tolerance.

The producer API allows you to write events to one or more topics in Kafka. You can specify various parameters such as the key, value, partition, timestamp, headers, compression type, and acknowledgment level of each event. The producer API also handles batching, buffering, serialization, error handling, and load balancing of the events.

The consumer API allows you to read events from one or more topics in Kafka. You can specify various parameters such as the group ID, offset, partition assignment strategy, deserialization type, poll interval, and commit behavior of each consumer. The consumer API also handles rebalancing, offset management, error handling, and concurrency control of the events.

The streams API allows you to perform stateful stream processing on the events in Kafka. You can define complex transformations, aggregations, joins, windows, state stores, and interactive queries on the events using a Java library that is built on top of the producer and consumer APIs. The streams API also handles scalability, fault tolerance, exactly-once semantics, and application reset of the stream processing applications.

The connector API allows you to integrate Kafka with external systems such as databases, cloud services, or legacy systems. You can use source connectors to ingest data from external systems into Kafka topics, or sink connectors to export data from Kafka topics to external systems. The connector API also handles configuration management.

Streams

Kafka Streams is an abstraction over producers and consumers that lets you focus on processing your Kafka data without worrying about low-level details. You can write your code in Java or Scala, create a JAR file, and then start your standalone application that streams records to and from Kafka topics.

Kafka Streams provides two main abstractions for processing data: KStream and KTable. A KStream represents an unbounded stream of records that can be transformed with stateless operations such as mapping, filtering, joining, and aggregating. A KTable represents a changelog stream of records that can be queried by key and updated with stateful operations such as counting, grouping, windowing, and joining.

Kafka Streams also supports interactive queries, which allow you to query the state of your application from external clients or services. You can use interactive queries to expose the latest values of your KTables or the contents of your state stores.

Best practises for designing scalable kafka architecture

Main aspects to consider

  • Partitioning: Partitioning is the key to achieving scalability and parallelism in Kafka. A topic is divided into one or more partitions, which are distributed across multiple brokers. Each partition has a leader and zero or more followers that replicate the data for fault tolerance. Producers write records to partitions based on the record key or a custom partitioner. Consumers read records from partitions in parallel by forming consumer groups and assigning partitions to group members
  • Replication: Replication is the mechanism that ensures data availability and durability in Kafka. Each partition can have a replication factor that specifies how many copies of the data are maintained on different brokers. The leader of a partition handles all read and write requests, while the followers passively replicate the data from the leader. If the leader fails, one of the followers will be elected as the new leader automatically
  • Serialization: Serialization is the process of converting data from one format to another for transmission or storage. Kafka uses byte arrays as the data type for records, so you need to serialize your data before sending it to Kafka and deserialize it after receiving it from Kafka. Kafka provides built-in serializers and deserializers for common data types such as strings, integers, bytes, etc. You can also use custom serializers and deserializers for complex data types such as JSON, Avro, Protobuf, etc.
  • Compression: Compression is the technique of reducing the size of data by removing redundancy or using encoding schemes. Compression can improve the performance and efficiency of Kafka by reducing network bandwidth usage, disk space usage, and CPU utilization. Kafka supports several compression codecs such as gzip, snappy, lz4, zstd, etc. You can configure compression at the producer level or at the topic level
  • Retention: Retention is the policy that determines how long data is kept in Kafka before being deleted or compacted. Retention can be based on time or size limits. Time-based retention deletes records that are older than a specified duration. Size-based retention deletes records when the total size of a partition exceeds a specified limit. You can also use log compaction to retain only the latest value for each record key

Best practices

  • Meaningful record keys: Record keys are important for determining how data is partitioned and processed in Kafka. You should use meaningful record keys that reflect the semantics of your data and your processing logic. For example, if you want to count the number of orders per customer, you should use customer ID as the record key. Avoid using null or random keys as they may cause uneven data distribution or unnecessary repartitioning
  • Avoid creating too many or too few topics: Too many topics can increase the overhead of managing metadata and broker connections. Too few topics can result in data skew and contention among consumers. A good rule of thumb is to have one topic per logical data type or domain
  • Choose a partition size that balances throughput and latency: A larger partition size can increase throughput by allowing more records to be batched together. However, it can also increase latency by delaying the delivery of records to consumers. A smaller partition size can decrease latency by delivering records faster. However, it can also decrease throughput by reducing batching efficiency. A good rule of thumb is to have a partition size between 1 MB and 10 MB
  • Choose a number of partitions that matches the level of parallelism you need: The number of partitions determines how many consumers can consume data from a topic in parallel. If you have more partitions than consumers, some partitions will be idle. If you have more consumers than partitions, some consumers will be starved. A good rule of thumb is to have at least as many partitions as the maximum number of expected consumers
  • Choose a replication factor that meets your availability and durability requirements: The replication factor determines how many copies of each partition are maintained on different brokers. A higher replication factor can improve availability by allowing consumers to switch to another replica if one fails. It can also improve durability by reducing the risk of data loss if a broker crashes. However, a higher replication factor can also increase disk space usage and network traffic. A good rule of thumb is to have a replication factor between 2 and 4.
  • Choose appropriate state stores: State stores are local databases that store the state of your streaming application, such as KTables or custom aggregations. Kafka Streams provides two types of state stores: RocksDB and in-memory. RocksDB is a persistent key-value store that supports fast and efficient access to large amounts of data. In-memory state stores are faster but less durable and more memory-intensive. You should choose the state store type that suits your performance and reliability requirements
  • Tune your configuration parameters: Kafka Streams exposes many configuration parameters that allow you to customize the behavior and performance of your streaming application. You should tune these parameters according to your use case and environment. For example, you can adjust the buffer sizes, batch sizes, commit intervals, cache sizes, poll timeouts, etc. to optimize the throughput and latency of your application. You can also enable metrics and logging to monitor and troubleshoot your application
  • Test and benchmark your application: Testing and benchmarking are essential steps for ensuring the quality and performance of your streaming application. You should test your application with realistic data and load scenarios, and measure the key metrics such as throughput, latency, resource utilization, error rate, etc. You should also compare the results with your expectations and requirements, and identify any bottlenecks or issues that need to be resolved
  • Secure your Kafka cluster and data: You should enable authentication, authorization, and encryption for your Kafka cluster, using mechanisms such as SSL, SASL, or Kerberos. You should also protect your data from unauthorized access, tampering, or loss, using features such as ACLs, quotas, replication, and backups

Common problems and solutions

Data Duplication

Use Kafka’s built-in data redundancy and resiliency features to ensure that data is stored only once. You can also use Kafka Connect to integrate Kafka with other systems and avoid data duplication.

Data loss or cluster failure

You can configure your Kafka cluster for fault tolerance. You can use replication to ensure that your data is stored on multiple brokers. You can also use a load balancer to distribute traffic across your brokers.

Data Skew

Data skew can occur when data is not evenly distributed across partitions, which can impact the performance of your Kafka cluster. To solve this problem, you can use partitioning strategies that evenly distribute data across partitions, such as key-based partitioning or round-robin partitioning.

Slow rebalance time

You can leverage caching to reduce the amount of data that needs to be rebalanced. You can also monitor the amount of data on the topics to ensure that Kafka’s performance is not impacted.

Network congestion

Optimize your network configuration by using high-speed network interfaces, configuring your network for low latency, and using compression to reduce the amount of data being transmitted.

Redundant data storage

Use Kafka only for storing data for a brief period and migrate data to a relational or non-relational database, depending on your specific requirements. You can also configure Kafka to use HDFS or blob storage for additional permanence.

Late or out-of-order events

You can use event-time semantics and windowing operations in Kafka Streams to handle late or out-of-order events. Event-time semantics means that you use the timestamp embedded in the event rather than the processing time to determine the order of events. Windowing operations allow you to group events into fixed or dynamic time intervals and apply aggregations or joins on them. You can also specify grace periods and retention periods for windows to handle late arrivals or updates.

Schema evolution or data format changes

You can use schema registry and serialization frameworks such as Avro, Protobuf, or JSON Schema to handle schema evolution or data format changes. Schema registry is a service that stores and manages the schemas of your data. Serialization frameworks allow you to serialize and deserialize your data with schema compatibility checks and conversions. You can integrate schema registry and serialization frameworks with Kafka Streams using custom serializers and deserializers.

Security vulnerabilities

You can secure your Kafka cluster by using encryption, authentication, and authorization. You can also monitor your Kafka cluster for security issues and take corrective action when necessary.

Inadequate monitoring

You can implement a comprehensive monitoring solution that tracks key metrics such as throughput, latency, and error rates. You can also set up alerts to notify you when performance issues arise and take corrective action when necessary.

Improper implementation

It can lead to inefficient data processing and technical debt. To avoid it, you can follow best practices when designing and implementing your Kafka architecture. You can also monitor your Kafka cluster for performance and availability issues and take corrective action when necessary.

Final words

Designing and implementing scalable Kafka architecture for high-performance streaming applications requires careful planning and implementation. Kafka Event Driven Architecture, Kafka Streams, and Apache Pulsar distributed messaging and streaming platform are some of the solutions to improve Kafka performance.

--

--