Apache Kafka: Part 5 - Administration and Monitoring
Welcome to Part 5 of my Apache Kafka series! Today we’ll dive into the critical aspects of Kafka administration and monitoring that separate development environments from production ready systems.
Full disclosure: This content comes from the Confluent Apache Kafka Fundamentals Course at training.confluent.io. Catch up on Part 1, Part 2, Part 3, and Part 4 if you haven’t already!
The Role of a Kafka Administrator
A Kafka administrator ensures high availability, optimal performance, and secure operations of Kafka clusters. This role combines system administration, performance engineering, and security management.
Core Responsibilities
- Cluster Management: Configuration, scaling, and health monitoring
- Performance Tuning: Optimizing throughput, latency, and resource utilization
- Security Management: Authentication, authorization, and encryption
- Monitoring and Troubleshooting: Proactive issue detection and resolution
- Data Management: Replication, backup strategies, and retention policies
Cluster Management
Static vs Dynamic Configuration
Static Configuration requires broker restarts and must be consistent across all brokers:
# server.properties (requires restart)
broker.id=1
listeners=PLAINTEXT://localhost:9092
log.dirs=/var/kafka-logs
num.network.threads=8
num.io.threads=16
Dynamic Configuration allows runtime changes without restarts:
# Change topic level configuration
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic \
--alter --add-config retention.ms=604800000
# Change broker level configuration
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type brokers --entity-name 1 \
--alter --add-config log.segment.bytes=134217728
This is really useful when you need to tweak settings without taking down your cluster.
Broker Configuration Management
Key broker configurations for production:
# Performance tuning
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
# Memory and storage
log.segment.bytes=1073741824
log.retention.hours=168
log.retention.bytes=1073741824
log.cleanup.policy=delete
# Replication and durability
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
Monitoring Strategies
Essential Metrics to Track
Throughput Monitoring
# Monitor messages per second
kafka-run-class.sh kafka.tools.ConsumerPerformance \
--bootstrap-server localhost:9092 \
--topic my-topic --messages 1000000
Broker Resource Utilization
- CPU Usage: Should typically stay below 80%
- Memory: Monitor heap usage and GC patterns
- Disk I/O: Track read/write rates and queue depths
- Network: Monitor bandwidth utilization
Replication Health
# Check under replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-replicated-partitions
If you see under replicated partitions, that’s a red flag that something needs attention.
JMX Monitoring
Kafka exposes comprehensive metrics via JMX (Java Management Extensions):
# Enable JMX on broker startup
export JMX_PORT=9999
kafka-server-start.sh config/server.properties
Key JMX Metrics
Gauge Metrics (point in time measurements):
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionskafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Meter Metrics (rate measurements):
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSeckafka.network:type=RequestMetrics,name=RequestsPerSec
Monitoring Tools Integration
Prometheus + Grafana Setup
# prometheus.yml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets: ['localhost:9999']
metrics_path: /metrics
scrape_interval: 15s
Popular Monitoring Solutions
- Confluent Control Center: Enterprise grade monitoring and management
- JConsole: Built in JVM monitoring tool
- Grafana + Prometheus: Open source monitoring stack
- DataDog: Cloud based monitoring service
- CloudWatch: AWS native monitoring (for MSK)
Performance Tuning
Producer Performance Optimization
// High throughput producer configuration
Properties props = new Properties();
props.put("batch.size", 65536); // Larger batches
props.put("linger.ms", 10); // Wait for batching
props.put("compression.type", "lz4"); // Fast compression
props.put("buffer.memory", 67108864); // 64MB buffer
props.put("acks", "1"); // Balance durability/speed
Consumer Performance Tuning
// Optimized consumer settings
props.put("fetch.min.bytes", 50000); // Reduce fetch requests
props.put("fetch.max.wait.ms", 500); // Max wait for min bytes
props.put("max.partition.fetch.bytes", 1048576); // 1MB per partition
props.put("session.timeout.ms", 30000); // Longer session timeout
Throughput vs Latency Trade offs
Throughput Focus: Increase batch.size, linger.ms, and use compression
Latency Focus: Decrease batch.size, set linger.ms=0, use acks=1
Managing Throughput
# Producer settings for maximum throughput
batch.size=65536
linger.ms=20
compression.type=snappy
acks=1
Managing Latency
# Producer settings for minimum latency
batch.size=0
linger.ms=0
acks=1
compression.type=none
Acknowledgment Levels
// Configure acknowledgment strategy
props.put("acks", "0"); // No acknowledgment (fastest, risky)
props.put("acks", "1"); // Leader acknowledgment (balanced)
props.put("acks", "all"); // All replicas (safest, slowest)
Pick the right one based on your use case. If you’re dealing with financial data, you probably want “all”. For metrics that can tolerate some loss, “1” might be fine.
Security Management
Authentication Methods
# SASL/PLAIN authentication
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
username="admin" password="admin-secret";
SSL/TLS Encryption
# SSL configuration
security.protocol=SSL
ssl.keystore.location=/path/to/kafka.server.keystore.jks
ssl.keystore.password=keystore-password
ssl.key.password=key-password
ssl.truststore.location=/path/to/kafka.server.truststore.jks
ssl.truststore.password=truststore-password
Access Control Lists (ACLs)
# Grant read access to a topic
kafka-acls.sh --bootstrap-server localhost:9092 \
--add --allow-principal User:alice \
--operation Read --topic my-topic
# Grant write access to a topic
kafka-acls.sh --bootstrap-server localhost:9092 \
--add --allow-principal User:bob \
--operation Write --topic my-topic
ACLs let you control who can do what with your topics. Pretty important for production environments!
Data Management
Replication Configuration
# Create topic with replication factor 3
kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic critical-data \
--partitions 6 --replication-factor 3
Retention Policies
# Time based retention (7 days)
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic \
--alter --add-config retention.ms=604800000
# Size based retention (1GB)
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic \
--alter --add-config retention.bytes=1073741824
High Availability and Fault Tolerance
What Makes Kafka Highly Available?
- Distributed Architecture: Operates across multiple data centers
- Automatic Failover: Leader election happens automatically
- Data Replication: Multiple copies prevent data loss
- Partition Distribution: Load spreads across brokers
Handling Broker Failures
# Check cluster health after failure
kafka-broker-api-versions.sh --bootstrap-server localhost:9092
# Verify partition leadership
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic my-topic
Automatic Recovery: When a broker fails, Kafka automatically elects new leaders for affected partitions, ensuring continued operation without manual intervention.
Operational Best Practices
Upgrades and Patches
# Rolling upgrade process
1. Stop one broker
2. Update Kafka binaries
3. Update configuration if needed
4. Start the broker
5. Verify cluster health
6. Repeat for next broker
The key here is to do rolling upgrades so you never take down the whole cluster at once.
Backup Strategies
# Mirror topics to another cluster
kafka-mirror-maker.sh \
--consumer.config source-consumer.properties \
--producer.config target-producer.properties \
--whitelist "important-.*"
Capacity Planning
Monitor these metrics for scaling decisions:
- Disk usage growth rate
- Network bandwidth utilization
- CPU and memory trends
- Consumer lag patterns
What’s Next?
In Part 6, we’ll explore KRaft. This is Kafka’s new consensus protocol that eliminates ZooKeeper dependency and simplifies cluster management. It’s a big deal for Kafka operations!
Key Takeaways
- Monitoring is Critical: Use JMX metrics with tools like Prometheus and Grafana
- Performance Tuning: Balance throughput vs latency based on use case requirements
- Security Layers: Implement authentication, authorization, and encryption
- High Availability: Leverage replication and automatic failover capabilities
- Operational Excellence: Plan for upgrades, monitoring, and capacity management
Proper administration and monitoring transform Kafka from a development tool into a production ready, enterprise grade streaming platform.
See you in Part 6!
Part 5 of 6
Comments
Join the discussion and share your thoughts