Apache Kafka: Part 2 - Understanding the Core Building Blocks

Full disclosure: Most of the content in this series comes from the Confluent Apache Kafka Fundamentals Course at training.confluent.io. I’m sharing my notes and trying to make them easier to understand!

The Kafka Architecture Puzzle

In Part 1, we learned why Kafka matters. Now let’s understand how it works.

Think of Kafka like a city’s postal system, but for data. You have senders (producers), recipients (consumers), post offices (brokers), and organized delivery routes (topics and partitions). Each component has a specific role, and together they create a system that can handle millions of messages reliably.

Let’s break down each piece of this puzzle.

Messages: The Fundamental Unit

What is a Message?

A message (also called an event or record) is the fundamental unit of data in Apache Kafka. It’s like a letter in our postal system analogy.

Every Kafka message consists of two main components.

The Key (Optional but Powerful)

{
  "key": "user-12345"
}

The key helps determine which partition a message will go to in a Kafka topic. By assigning a key to a message, Kafka ensures that all messages with the same key are sent to the same partition.

Why does this matter?

Message ordering: All messages with the same key stay in order
Data locality: Related data gets grouped together
Consumer efficiency: Consumers can process related messages together

The Value (The Actual Content)

{
  "value": {
    "userId": "user-12345",
    "action": "purchase",
    "productId": "laptop-abc",
    "timestamp": "2025-02-19T10:30:00Z",
    "amount": 999.99
  }
}

The value is the actual content of the message. It can be any type of data:

JSON (most common)
Plain text
Avro (schema based)
Binary data

Immutability is Key: Kafka messages are immutable. Once they’re written to a topic, they can’t be modified. This ensures data integrity and allows consumers to replay messages at any time.

Topics: Organizing the Chaos

What is a Topic?

A topic is like a category or folder where messages are organized. Think of it as a specific mailbox or delivery route in our postal system.

Topic: "user-purchases"
├── Message 1: User A bought a laptop
├── Message 2: User B bought headphones
├── Message 3: User A bought a mouse
└── Message 4: User C bought a keyboard

Key Characteristics

Logical Grouping: Topics allow data to be categorized so that:

Producer systems can publish messages to a particular topic
Consumer systems can read messages from that topic

Time Based Storage: Messages are stored with a timestamp and kept for a configurable period (retention policy).

Replay Capability: This retention feature allows consumers to read messages at their own pace and replay messages if needed.

Real World Topic Examples

E-commerce Platform:
├── user-registrations
├── product-views
├── shopping-cart-events
├── order-payments
├── inventory-updates
└── shipping-notifications

Banking System:
├── account-transactions
├── fraud-alerts
├── loan-applications
├── credit-score-updates
└── regulatory-reports

Partitions: The Secret to Scale

What are Partitions?

Here’s where Kafka gets really clever. Topics are split into partitions, allowing Kafka to horizontally scale and manage large volumes of data.

Topic: "user-purchases" (3 partitions)

Partition 0: [Msg1] [Msg4] [Msg7] [Msg10]
Partition 1: [Msg2] [Msg5] [Msg8] [Msg11]
Partition 2: [Msg3] [Msg6] [Msg9] [Msg12]

How Partitions Work

Ordered Sequence: Each partition is an ordered, immutable sequence of messages. Kafka ensures messages are written in the order they’re received within each partition.

Parallel Processing: When a producer sends messages, they’re distributed across different partitions. This means multiple consumers can read messages concurrently from different partitions, speeding up processing.

Distribution: Each partition is stored on a broker (server or node).

Fault Tolerance: If one broker goes down, other brokers can take over the partitions and continue delivering data, ensuring high availability.

Quiz Time: In Kafka, what is the purpose of a Partition in a topic?

Answer: To enable parallel processing and scalability!

Partition Assignment Strategy

When a producer sends a message:

With Key: Messages with the same key always go to the same partition
Without Key: Messages are distributed round robin across partitions
Custom Logic: You can implement custom partitioning logic

// Messages with same user ID go to same partition
Key: "user-12345" → Always Partition 1
Key: "user-67890" → Always Partition 2
Key: "user-12345" → Always Partition 1 (again)

Offsets: Keeping Track

What is an Offset?

Each message within a partition is assigned an offset. This is a unique identifier for that message. Think of it like a page number in a book.

Partition 0:
Offset 0: [Message A]
Offset 1: [Message B]
Offset 2: [Message C]
Offset 3: [Message D]

Purpose: The offset helps consumers keep track of which messages they have already read. It’s like a bookmark that remembers where you left off.

Producers: The Data Senders

What is a Producer?

A producer is any application or system that sends data to the Kafka cluster. It’s like someone dropping off mail at the post office.

// Simplified producer example
KafkaProducer<String, String> producer = new KafkaProducer<>(props);

ProducerRecord<String, String> record = new ProducerRecord<>(
    "user-purchases",           // topic
    "user-12345",              // key
    "{\"product\": \"laptop\"}" // value
);

producer.send(record);

Producer Superpowers

Asynchronous Processing: Producers can continue processing other tasks while Kafka handles message delivery. No waiting around!

Partitioning Control: Producers can send records to specific partitions within a topic for load balancing and ensuring related messages stay together.

Serialization Support: The Producer API allows data to be serialized into formats like JSON, Avro, or plain text before being sent to Kafka.

Fault Tolerance: Built in mechanisms ensure data reliability. Producers can wait for acknowledgment from Kafka brokers to confirm message delivery.

Compression: Supports message compression to reduce network bandwidth, making data transfer more efficient.

The Power of Decoupling: Producers can keep generating and pushing data, even if consumers are temporarily unavailable. This decoupling is what makes Kafka so resilient.

Consumers: The Data Receivers

What is a Consumer?

A Kafka consumer is an application or process that reads messages from Kafka topics. It’s like someone picking up mail from their mailbox.

// Simplified consumer example
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("user-purchases"));

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.println("Received: " + record.value());
    }
}

Consumer Groups: Teamwork Makes the Dream Work

Consumer Groups: Consumers operate in groups. If there are more consumers than partitions, some consumers will remain inactive.

Topic: "orders" (3 partitions)
Consumer Group: "order-processors"

Partition 0 → Consumer A
Partition 1 → Consumer B
Partition 2 → Consumer C

Key Rules:

Each partition is consumed by only one consumer in a group at a time
This maintains consistency and enables parallel processing
Each consumer maintains its offset (position in the partition)

Fault Tolerance: If one consumer fails, another can start reading from its partition, ensuring continuous processing without interruptions.

Quiz Time: How does a Kafka Consumer keep track of messages it has consumed?

Answer: By maintaining an offset for each partition it reads from!

The Beauty of Decoupling

Producer Independence: Producers can keep generating data even if consumers are temporarily unavailable.

Consumer Flexibility: Consumers process data at their own speed, ensuring they’re not overwhelmed if data arrives too quickly.

Resilience: If a consumer goes down, it doesn’t affect the producer. When the consumer comes back online, it can resume reading from where it left off.

Brokers: The Workhorses

What is a Kafka Broker?

A Kafka broker (or node) is a server that stores data and serves clients. It’s like a post office in our analogy.

Broker Responsibilities

Data Storage: Listens for incoming messages and stores them in partitions

Client Serving: Serves messages to consumers when requested

Partition Management: Manages topic partitions and their storage

Request Handling: Handles producer and consumer requests efficiently

Each broker in a Kafka cluster is identified by a unique ID.

Clusters: The Big Picture

What is a Kafka Cluster?

A Kafka cluster is a group of servers (brokers) that work together to manage the flow of data. It’s like a network of post offices working together.

Kafka Cluster
├── Broker 1 (ID: 1)
├── Broker 2 (ID: 2)
├── Broker 3 (ID: 3)
└── Broker 4 (ID: 4)

Cluster Superpowers

High Availability: If one broker fails, others continue operating

Durability: Data is replicated across multiple brokers for redundancy

Scalability: Add more brokers to handle increased load

Fault Tolerance: Automatic recovery when brokers fail

Load Distribution: Partitions are distributed across brokers for balanced load

How It All Works Together

Producers send messages to topics
Topics are split into partitions
Partitions are distributed across brokers in the cluster
Consumers read messages from partitions
Offsets track consumer progress

Replication: Insurance for Your Data

Why Replication Matters

Replication ensures data availability and data integrity. In Kafka, replication refers to copying data across multiple brokers.

How Replication Works

Replicas: Each partition is distributed across several brokers with multiple copies called replicas.

Leader and Followers:

Leader: The only replica that handles reads and writes
Followers: Replicate data from the leader

Automatic Failover: If the leader fails, one of the followers is automatically promoted to be the new leader.

Partition "orders-0" (Replication Factor: 3)

Broker 1: [Leader Replica]    ← Handles reads/writes
Broker 2: [Follower Replica]  ← Replicates from leader
Broker 3: [Follower Replica]  ← Replicates from leader

Common Configuration: Replication Factor of 3 (one leader + two followers)

Retention Policy

Kafka provides a data retention policy that defines how long data is stored before deletion. This can be configured per topic or as a default global retention period.

Time based: Keep messages for 7 days Size based: Keep up to 1GB of messages per partition Compacted: Keep only the latest message for each key

Putting It All Together

Let’s trace a message through the entire Kafka system:

Producer creates a message with key “user-123” and value “purchase event”
Kafka determines the message goes to partition 2 of topic “user-events” (based on key hash)
Broker 1 (leader for partition 2) receives and stores the message at offset 1047
Brokers 2 and 3 (followers) replicate the message for fault tolerance
Consumer Group A has Consumer X reading from partition 2
Consumer X polls and receives the message, processes it, and commits offset 1047
Consumer X is now ready to read the next message at offset 1048

The Magic: All of this happens automatically, at scale, with fault tolerance, in milliseconds. That’s the power of Kafka’s architecture!

What’s Next?

Now you understand Kafka’s core building blocks and how they work together. In Part 3 of this series, we’ll see these concepts in action:

Kafka Cluster Architecture: How partitions and replicas are distributed
Creating Topics: Hands-on with kafka-topics.sh
High Availability: How replication keeps your data safe
Partition Distribution: Understanding leader election and load balancing

Ready to see how a real Kafka cluster works? Let’s dive in!

Key Takeaways

Messages are key value pairs that flow through Kafka
Topics organize messages into logical categories
Partitions enable parallel processing and scalability
Offsets track consumer progress through partitions
Producers send data asynchronously with fault tolerance
Consumers read data in groups for parallel processing
Brokers store and serve data across a distributed cluster
Replication ensures data safety and high availability

Understanding these building blocks is crucial for working effectively with Kafka. Each component plays a vital role in creating a system that can handle massive scale with reliability and performance.

The architecture might seem complex, but each piece serves a purpose in creating the robust, scalable event streaming platform that powers modern applications!