Kafka System Design
Kafka is an open-source and distributed streaming platform. It allows us to create real-time data feeds and stream event-driven applications. The event data describes what happened, when, and who was involved. Event streaming captures events, like the example, in real-time from sources like databases, sensors, mobile devices, cloud services, and software applications.
At its core, Kafka is designed to process high-volume data streams to handle millions of messages quickly. To achieve this, the Kafka system design ensures it fits suitability that creates data replications and high throughput to ensure fault-tolerance, scalability, and data durability.
Understanding Kafka comes with a list of benefits. In this post, we will discuss the Kafka system design.
What makes the Kafka system design standout
To Understand Kafka, let’s learn the features of Kafka system design different from other distributed systems.
- Decoupling and Async Processing of data – Kafka models are built to create asynchronous processing for producers to send messages and consumers to subscribe to certain messages. This creates decoupled architecture. Producers and consumers are entirely independent of each other. This prevents block-wait scenarios, as producers and consumers can be developed, deployed, and managed separately. Producers publish data to Kafka without knowing who the consumers are. Consumers consume messages without the need for producers’ addresses.
- Kafka has configurable message retention policies that ensure message persistence and resilience. We can use such approaches to instruct Kafka to store messages stored for a specified period or until a specific size threshold is reached. This way, messages are always available and not lost even when consumers are offline. The persistence allows us to create analytics and understand how our system functions. NB Kafka allows data pasistance. However, Kafka can’t be used as a database. Its system design is not created to replace a database architecture
- Kafka uses partitioning to split data into smaller chunks into multiple machines. If one Kafka server fails, clients are to switch to a different server and process messages. On top of that, we can scale horizontally and add more partitions to distribute workloads across multiple servers. By replicating partitions across various servers, we create fault-tolerance systems.
- A system design requires scalability characteristics. Kafka handles extensive data that allows an application to scale on demand. Kafka is suitable for large-scale distributed systems that require high throughput and low latency. Partitions are replicated on different servers, and availability can be enhanced by adding more server instances.
- Data is processed as soon as it is available. This allows us to create real-time processing applications with low latency.
- Kafka has message ordering that processes messages in sequential I/O access patterns. This creates a sequential data structure design pattern that allows Kafka to process messages in the order they are received. This makes Kafka much faster to read and write blocks of data one.
Kafka system design architecture
The Kafka architecture is designed to achieve the above benefits and features. Kafka’s architecture is a distributed system model that uses the publish-subscribe model to allow multiple producers to send data to multiple consumers.
As we discussed earlier, Kafka runs partitioned servers to spread workloads evenly across replicas and partitions. This server acts as a group of Kafka brokers (Based on Kafka’s terminology) to create a group of devices working together while achieving Kafka provides high availability. A group of such brokers is called a Kafka cluster, where each Kafka broker in the cluster is a standalone server.
Each broker has its system design components that run a single Kafka server. Let’s discuss how Kafka architecture benefits its system design based on its core components.
● Topics
A message/record is always the smallest unit of the Kafka system design ecosystem. Each message creates a single record entry in a Kafka topic. So topics category messages in a Kafka cluster.
To allow message persistence, the topic uses partitions that order the records as immutable sequences of messages. Each message in a topic has a unique offset. It represents the position of the message in the partition.
Partitions replicate brokers in a Kafka cluster. This divides the topic into multiple partitions and gives the Kafka system design the ability to handle large data streams while providing high availability and fault tolerance. For each Kafka topic, the Kafka cluster keeps a partitioned log.
Every partition is a continuous input to a structured commit log, an ordered, immutable sequence of messages. The offset is a sequential id number assigned to every record in a partition and serves as a unique identifier for every record therein in a partition.
● Producers and Consumers
Producers write messages to Kafka topics. Producers create a message and publish it to Kafka. The producer’s job is very straightforward; create a message, send it to a specific topic, and partition it with a record key and metadata.
A Kafka producer is responsible for ordering messages within a partition using a key that identifies each message individually to determine which partition that message belongs to.
While producers write messages on a specific topic, consumers read these messages from one or more topics.
Consumers subscribe to its relevant topics to receive messages in any order it is instructed to. Consumers use an offset assigned to each partition they consume to ensure Kafka availability and reliability. If a consumer fails or restarts, it is easy to resume reading from where they left off.
Different consumers can work together to consume the same message from different topics. This creates a consumer group. Each consumer group receives a copy of the records on a topic. The messages are then evenly distributed within the consumers in the group, increasing Kafka throughput and fault tolerance further.
● Zookeeper
To hold its system design together, Kafka uses Zookeeper. Zookeepers are the coordinate controller of your cluster, access control, replicas, partitions, and topic configs. The role Zookeeper plays ensures Kafka’s performance and broker connectivity. Zookeepers notify consumers and producers when a new broker joins, or an existing broker fails and directs all requests to the main partitions server (broker).
Zookeepers read, write, and observe updates to data. It also coordinates each consumer’s last offset position. If a consumer client crashes, the consumer can quickly recover from the last position. When a consumer sends an acknowledgment to the broker as proof of receipt, Zookeeper stores the current offset value so that the consumer can receive the next offset in the same partition sequence.
How Kafka system design exposes messages
Kafka itself is written in Java. Its design supports client APIs for common programming languages like Java, Python, Go, and Javascript. These APIs provide easier integration support such as REST, HTTP, and WebSocket.
The communication between Kafka clients (producers and consumers) and the brokers use binary TCP-based protocol (Transmission Control Protocol) to serialize and deserialize its API data. The protocol defines all APIs as request-response message pairs. TCP connection guarantees message delivery. Requests will be processed in the order they are sent, and responses will return in that order.
To achieve this communication, Kafka data is stored in a cluster based on topics, and each entry in a topic is called a record. Each record in the Kafka topic contains a key, value, and an associated timestamp. The protocol message format has a header and a payload. The message payload format contains record key-value pairs, the topic it belongs to, partition IDs, timestamps, and key and value sizes.
Topic: my-topic
Partition: 0
Offset: 12345
Key: my-key
Value: {“name”: “Rose”, “age”: 24, “email”: “roseemail@example.com”}
Timestamp: 1677687782
Headers: [Header(key=”content-type”, value=”application/json”)]
While using these APIs, Kafka has five core APIs, namely:
- Producer API – Sends message records to a Kafka topic in the Kafka cluster. It allows clients to generate and publish data streams to the Kafka cluster.
- Consumer API – Allows clients to read data from subscribed topics in a Kafka cluster
- Streams API – Consumes real-time data streams for creating build stream processing applications. It achieves that by transforming streams of data from input topics to output topics
- Connector API – Creates connectors that import and export data from Kafka to other applications such as data stores, message queues and streaming platforms.
- Admin API – Manages a Kafka cluster. We use it for administrators’ cases, such as creating, configuring, monitoring and inspecting topics, brokers, and Kafka objects.
Wrapping Up
Kafka provides a vivid system design to optimize data using ETL pipelines to move data between different systems and applications. This allows us to:
- Extract data from diffrent direable sources
- Transform data into data models in an analysis-ready form that Kafka can use and
- Load data into data warehouses using Kafka
Alongside that, Kafka is designed to be integrable with GUI tools. They allow us to use a user-friendly interface and manage Kafka clusters, monitor Kafka metrics, and create and manage topics. Such GUI tools include:
- Condukto
- Kafdrop Kafka Web UI
- Redpanda Console UI
Kafka is designed to move a large amount of data in a short amount of time efficiently. Message brokers technically fail to achieve this with high throughput while ensuring scalability. Its system design supports large queues, message batching, and different consumers with different consumption requirements with distributed high throughput.