Apache Kafka: Part 5 - Administration and Monitoring

Welcome to Part 5 of my Apache Kafka series! Today we’ll dive into the critical aspects of Kafka administration and monitoring that separate development environments from production ready systems.

Full disclosure: This content comes from the Confluent Apache Kafka Fundamentals Course at training.confluent.io. Catch up on Part 1, Part 2, Part 3, and Part 4 if you haven’t already!

The Role of a Kafka Administrator

A Kafka administrator ensures high availability, optimal performance, and secure operations of Kafka clusters. This role combines system administration, performance engineering, and security management.

Core Responsibilities

Cluster Management: Configuration, scaling, and health monitoring
Performance Tuning: Optimizing throughput, latency, and resource utilization
Security Management: Authentication, authorization, and encryption
Monitoring and Troubleshooting: Proactive issue detection and resolution
Data Management: Replication, backup strategies, and retention policies

Cluster Management

Static vs Dynamic Configuration

Static Configuration requires broker restarts and must be consistent across all brokers:

# server.properties (requires restart)
broker.id=1
listeners=PLAINTEXT://localhost:9092
log.dirs=/var/kafka-logs
num.network.threads=8
num.io.threads=16

Dynamic Configuration allows runtime changes without restarts:

# Change topic level configuration
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name my-topic \
  --alter --add-config retention.ms=604800000

# Change broker level configuration
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type brokers --entity-name 1 \
  --alter --add-config log.segment.bytes=134217728

This is really useful when you need to tweak settings without taking down your cluster.

Broker Configuration Management

Key broker configurations for production:

# Performance tuning
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Memory and storage
log.segment.bytes=1073741824
log.retention.hours=168
log.retention.bytes=1073741824
log.cleanup.policy=delete

# Replication and durability
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

Monitoring Strategies

Essential Metrics to Track

Throughput Monitoring

# Monitor messages per second
kafka-run-class.sh kafka.tools.ConsumerPerformance \
  --bootstrap-server localhost:9092 \
  --topic my-topic --messages 1000000

Broker Resource Utilization

CPU Usage: Should typically stay below 80%
Memory: Monitor heap usage and GC patterns
Disk I/O: Track read/write rates and queue depths
Network: Monitor bandwidth utilization

Replication Health

# Check under replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

If you see under replicated partitions, that’s a red flag that something needs attention.

JMX Monitoring

Kafka exposes comprehensive metrics via JMX (Java Management Extensions):

# Enable JMX on broker startup
export JMX_PORT=9999
kafka-server-start.sh config/server.properties

Key JMX Metrics

Gauge Metrics (point in time measurements):

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

Meter Metrics (rate measurements):

kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.network:type=RequestMetrics,name=RequestsPerSec

Monitoring Tools Integration

Prometheus + Grafana Setup

# prometheus.yml
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['localhost:9999']
    metrics_path: /metrics
    scrape_interval: 15s

Performance Tuning

Producer Performance Optimization

// High throughput producer configuration
Properties props = new Properties();
props.put("batch.size", 65536);           // Larger batches
props.put("linger.ms", 10);               // Wait for batching
props.put("compression.type", "lz4");     // Fast compression
props.put("buffer.memory", 67108864);     // 64MB buffer
props.put("acks", "1");                   // Balance durability/speed

Consumer Performance Tuning

// Optimized consumer settings
props.put("fetch.min.bytes", 50000);      // Reduce fetch requests
props.put("fetch.max.wait.ms", 500);      // Max wait for min bytes
props.put("max.partition.fetch.bytes", 1048576); // 1MB per partition
props.put("session.timeout.ms", 30000);   // Longer session timeout

Throughput vs Latency Trade offs

Throughput Focus: Increase batch.size, linger.ms, and use compression

Latency Focus: Decrease batch.size, set linger.ms=0, use acks=1

Managing Throughput

# Producer settings for maximum throughput
batch.size=65536
linger.ms=20
compression.type=snappy
acks=1

Managing Latency

# Producer settings for minimum latency
batch.size=0
linger.ms=0
acks=1
compression.type=none

Acknowledgment Levels

// Configure acknowledgment strategy
props.put("acks", "0");    // No acknowledgment (fastest, risky)
props.put("acks", "1");    // Leader acknowledgment (balanced)
props.put("acks", "all");  // All replicas (safest, slowest)

Pick the right one based on your use case. If you’re dealing with financial data, you probably want “all”. For metrics that can tolerate some loss, “1” might be fine.

Security Management

Authentication Methods

# SASL/PLAIN authentication
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="admin" password="admin-secret";

SSL/TLS Encryption

# SSL configuration
security.protocol=SSL
ssl.keystore.location=/path/to/kafka.server.keystore.jks
ssl.keystore.password=keystore-password
ssl.key.password=key-password
ssl.truststore.location=/path/to/kafka.server.truststore.jks
ssl.truststore.password=truststore-password

Access Control Lists (ACLs)

# Grant read access to a topic
kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:alice \
  --operation Read --topic my-topic

# Grant write access to a topic
kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:bob \
  --operation Write --topic my-topic

ACLs let you control who can do what with your topics. Pretty important for production environments!

Data Management

Replication Configuration

# Create topic with replication factor 3
kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic critical-data \
  --partitions 6 --replication-factor 3

Retention Policies

# Time based retention (7 days)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name my-topic \
  --alter --add-config retention.ms=604800000

# Size based retention (1GB)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name my-topic \
  --alter --add-config retention.bytes=1073741824

High Availability and Fault Tolerance

What Makes Kafka Highly Available?

Distributed Architecture: Operates across multiple data centers
Automatic Failover: Leader election happens automatically
Data Replication: Multiple copies prevent data loss
Partition Distribution: Load spreads across brokers

Handling Broker Failures

# Check cluster health after failure
kafka-broker-api-versions.sh --bootstrap-server localhost:9092

# Verify partition leadership
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic my-topic

Automatic Recovery: When a broker fails, Kafka automatically elects new leaders for affected partitions, ensuring continued operation without manual intervention.

Operational Best Practices

Upgrades and Patches

# Rolling upgrade process
1. Stop one broker
2. Update Kafka binaries
3. Update configuration if needed
4. Start the broker
5. Verify cluster health
6. Repeat for next broker

The key here is to do rolling upgrades so you never take down the whole cluster at once.

Backup Strategies

# Mirror topics to another cluster
kafka-mirror-maker.sh \
  --consumer.config source-consumer.properties \
  --producer.config target-producer.properties \
  --whitelist "important-.*"

Capacity Planning

Monitor these metrics for scaling decisions:

Disk usage growth rate
Network bandwidth utilization
CPU and memory trends
Consumer lag patterns

What’s Next?

In Part 6, we’ll explore KRaft. This is Kafka’s new consensus protocol that eliminates ZooKeeper dependency and simplifies cluster management. It’s a big deal for Kafka operations!

Key Takeaways

Monitoring is Critical: Use JMX metrics with tools like Prometheus and Grafana
Performance Tuning: Balance throughput vs latency based on use case requirements
Security Layers: Implement authentication, authorization, and encryption
High Availability: Leverage replication and automatic failover capabilities
Operational Excellence: Plan for upgrades, monitoring, and capacity management

Proper administration and monitoring transform Kafka from a development tool into a production ready, enterprise grade streaming platform.

See you in Part 6!