⌛ KAFKA MONITORING ⌛

OBSERVABILITY & DEBUGGING

Master system health, performance metrics, and real-time troubleshooting for production Kafka clusters.

The Critical Role of Kafka Monitoring

Production Kafka deployments operate at massive scale—millions of messages flowing through partitions across distributed brokers. Without comprehensive monitoring and debugging capabilities, you're flying blind. System degradation, performance bottlenecks, and data loss can occur silently until they cascade into catastrophic failures.

Effective Kafka monitoring provides real-time visibility into cluster health, consumer lag, broker performance, and potential issues before they impact your data pipelines. This section explores industry-standard monitoring strategies, essential metrics, and debugging techniques that separate reliable production systems from those destined for failure.

Essential Kafka Metrics

Understanding which metrics matter is fundamental to operational excellence. Kafka exposes hundreds of metrics through JMX (Java Management Extensions). Here are the critical ones:

Broker-Level Metrics

Under-Replicated Partitions (URP): The number of partitions with fewer in-sync replicas than configured. URP > 0 indicates potential data loss risk. This is your early warning system for cluster health degradation.
Controller Availability: Whether a broker is acting as the controller. Controller loss can cause cluster-wide coordination failures. Monitor for unexpected controller elections.
Disk I/O and CPU Usage: Brokers performing disk-intensive operations (log compaction, segment rolling) or high CPU under heavy load. These often indicate capacity constraints.
Network In/Out Bytes: Total bytes received and transmitted. Sustained high rates indicate heavy traffic; sudden drops suggest producer/consumer issues.
Fetch Request Latency: Time to serve fetch requests from consumers. Elevated latency indicates slow disk reads, network congestion, or broker resource starvation.

Topic and Partition Metrics

Messages In Rate: Messages produced per second to a topic. Useful for baseline understanding and detecting production anomalies.
Bytes In/Out Rate: Throughput in bytes per second. Compare against SLAs to identify performance degradation.
Bytes Fetched Per Second: Consumer read rate. Helps identify imbalances between producer and consumer speeds.
Log Size: Physical size of partition logs on disk. Growing indefinitely suggests retention policy issues or lack of cleanup.
Offset Lag: Difference between the latest offset and the committed offset. Critical for consumer lag tracking.

Consumer Group Metrics

Consumer Lag: Difference between the latest partition offset and the consumer group's committed offset. High lag indicates slow consumers or processing delays. This metric drives prioritization for debugging.
Lag Trend: Whether lag is growing (consumers falling behind), stable, or decreasing. Growing lag for critical topics demands immediate investigation.
Max Lag per Partition: Identifies which partition a consumer group lags furthest behind on. Helps pinpoint which producer's data is falling behind.
Last Offset Committed Timestamp: Time since the last offset commit. A stale timestamp indicates a consumer has stopped processing.

Setting Up JMX Monitoring

Kafka brokers export metrics via JMX by default. To enable remote JMX monitoring, configure the broker's startup script:

# In kafka-server-start.sh or equivalent
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=broker-hostname"

# Start the broker with JMX enabled
./kafka-server-start.sh ../config/server.properties

Once JMX is enabled, tools can connect to port 9999 to access metrics. Many monitoring platforms (Prometheus, Datadog, New Relic) include Kafka JMX exporters that automatically scrape these metrics and aggregate them into dashboards.

Recommended Monitoring Stack (2026)

Prometheus: Industry-standard time-series database for metrics collection. The jmx_exporter bridges JMX metrics into Prometheus format.
Grafana: Visualization layer for creating dashboards. Pairs perfectly with Prometheus. Kafka-specific dashboard templates are widely available.
Alertmanager: Triggers alerts when metrics cross thresholds (e.g., consumer lag > 10000, URP > 0).
ELK Stack (Elasticsearch, Logstash, Kibana): For structured log analysis and correlation with metrics during incidents.

Consumer Lag Monitoring and Troubleshooting

Consumer lag is the single most important metric for production systems. It directly indicates whether your data pipelines can keep up with incoming data. Growing consumer lag is an urgent signal.

Detecting Lag Issues

Check consumer group offset status using the command-line tools:

# Describe a consumer group (shows committed offsets and lag)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-app-group --describe

# Sample output:
# GROUP           TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID
# my-app-group    events   0          50000           52000           2000 consumer-1
# my-app-group    events   1          48500           50500           2000 consumer-2

Root Causes and Solutions

Slow Consumers (Processing Delays): Consumer application is processing slowly. Check application logs, CPU/memory usage, and database query times. Consider increasing consumer instances or optimizing the processing logic.
Consumer Crashes or Rebalancing: Consumer processes are crashing or rebalancing frequently. Monitor for exceptions in consumer logs, check for resource exhaustion, and review rebalancing frequency metrics.
Broker Performance Degradation: Brokers are slow to serve fetch requests. Check broker CPU, disk I/O, and network bandwidth. Review under-replicated partitions and leader election status.
Network Bottlenecks: Network saturation between brokers and consumers. Check network throughput metrics and consider increasing network capacity or splitting load across multiple clusters.
Partition Imbalance: One partition has much higher lag than others. Indicates uneven key distribution or a stuck consumer. Check partition key cardinality and rebalance topics if needed.

Debugging Kafka Problems: Common Scenarios

Scenario 1: Messages Disappearing

Producers report success but consumers never see messages. Check producer acks configuration—if acks=1 or acks=0, messages may be lost on broker failure. Enable acks=all for durability. Check broker retention settings and log compaction policies. Verify the topic exists and partitions are healthy (check for under-replicated partitions).

Scenario 2: Partition Stuck Without Leader

One partition shows all replicas as out-of-sync, with no leader election occurring. This often happens when the zookeeper quorum is unstable or too many brokers crash simultaneously. Check ZooKeeper connectivity and broker startup logs. If recovery is needed, using replica.lag.time.max.ms configuration can force replica recovery.

Scenario 3: Rebalancing Storm

Consumer group continuously rebalances, causing processing interruptions. Common causes: session timeouts due to slow consumers (increase session.timeout.ms), too-aggressive heartbeat settings, or resource exhaustion on consumer machines. Analyze rebalancing logs and tune consumer session parameters.

Scenario 4: Duplicate Message Processing

Idempotence is critical—ensure producers have enable.idempotence=true and consumers properly handle duplicates. If using exactly-once semantics, ensure transactional writes and isolation levels are configured correctly. Check for consumer offset commits happening out of order.

Advanced Debugging Tools and Techniques

Kafka Broker Logs

The server.log file on each broker contains detailed information about rebalancing, leader elections, and errors. Set log level to DEBUG during investigation (increase log4j verbosity in config/log4j.properties), but revert to INFO or WARN for production to avoid performance impact.

Offset Management Tools

kafka-consumer-groups.sh --reset-offsets: Manually adjust consumer group offsets for replaying messages or skipping corrupted data. Use with caution—can cause data loss or reprocessing if misconfigured.
kafka-run-class.sh kafka.tools.JmxTool: Direct JMX metric inspection without external monitoring infrastructure.
Burrow (Linkedin's Consumer Lag Monitor): Dedicated tool for tracking consumer lag across multiple clusters and groups. Provides historical lag trends and alerting.

Network and Protocol Analysis

For protocol-level debugging, capture Kafka traffic with tcpdump and analyze using Wireshark. This reveals corruption, timeouts, or malformed requests. This is nuclear option debugging—only needed when basic metrics don't explain the issue.

Monitoring Best Practices for 2026

Define SLOs for Critical Topics: Establish service-level objectives (maximum acceptable lag, error rates, latency percentiles). Make these visible in dashboards and trigger alerts at 80% of threshold.
Implement Multi-Cluster Monitoring: Large organizations run multiple Kafka clusters. Centralize metrics collection and alerting to identify patterns across clusters.
Track End-to-End Latency: Measure time from message production to consumption, not just broker metrics. This reveals application-level bottlenecks.
Correlation with Application Metrics: Link Kafka metrics with application performance metrics (request latency, error rates, database query times). Debugging is faster when you see the full picture.
Alerting Hierarchy: Tier alerts by severity. Page on-call engineers for critical issues (URP > 0, cluster offline), let others roll up to daily reports.
Capacity Planning Based on Trends: Use historical metrics to forecast when you'll hit resource limits. Provision proactively rather than reactively.
Incident Runbooks: Document decision trees for common alerts. When consumer lag spikes, what's your first check? Codify this in runbooks accessible during incidents.

Tools and Ecosystems

The Kafka monitoring ecosystem is mature and diverse. Popular platforms include Confluent Cloud (fully managed with built-in monitoring), self-hosted Prometheus + Grafana, Datadog, New Relic, and Splunk. Each approach has trade-offs between operational overhead and feature richness. For 2026, cloud-native monitoring (integrating with Kubernetes and container orchestration) is becoming standard.

Regardless of tooling, the principles remain constant: measure broker health, track consumer lag obsessively, correlate metrics with application behavior, and maintain runbooks for rapid incident response. Production Kafka requires vigilance. The clusters that run smoothly aren't lucky—they're monitored.

Next: Kafka Performance Tuning