With any new and fast moving technology stack such as Kafka, monitoring and operational tools are often a step behind or missing significant functionality. But we do have a couple of robust open source projects which are available and can be made to work in specific circumstances. One such tool is Burrow from LinkedIn, written in Go, another is Kafka Manager from Yahoo, written in Scala, both of which are active projects and relied upon in significant and complex Kafka environments. There are others as well such as KafkaOffsetMonitor, but be careful with projects that haven’t updated in a long time as they are probably targeting Kafka v0.8 and not compatible with newer versions of Kafka. Version support is a big problem for Consumer Group monitoring because what works for one v0.8 application might not work for a v0.10 application.
In our last blog we discussed details about what Consumer Lag is. In this blog, we will discuss the limitations of some of the open source alternatives. In the next part of this series, we will discuss how OpsClarity addresses the above limitations and provides a comprehensive solution to monitor Kafka and Kafka Consumer Lag.
Pros and Cons of Using Burrow
Let’s take Burrow as an example. I’m going to put my finger on a couple of sore spots which one might find in any software, but let me also say that Burrow is high quality software and a solid first generation monitoring solution for Kafka. Burrow was originally written for the first Kafka v0.8 Consumer Group reference implementation, which used Zookeeper for metadata, and naturally was built for the behaviors one might see in that first Consumer Group implementation. Offsets are tracked by running a Kafka client to read each consumer offset partition continuously, which then counts the number of messages consumed by each client. This surprised me when I first dug into why Burrow was so busy, so I’ll repeat it: Burrow creates one Kafka client for every partition being monitored. If you have a thousand partitions being consumed by various Consumer Groups, Burrow will start up and run a thousand Kafka clients. I’m sure there are good reasons for doing this instead of using the OffsetFetchRequest API periodically–perhaps this way offers more intermediate data points and so greater precision and granularity control–but the relative cost is quite high. As a caution, you’ll have to be careful about Burrow capacity, watching for lag on these internal clients, and the impact on the rest of Kafka.
Kafka Manager works similarly and cautions:
Kafka managed consumer offset is now consumed by KafkaManagedOffsetCache from the “__consumer_offsets” topic. Note, this has not been tested with large number of offsets being tracked. There is a single thread per cluster consuming this topic so it may not be able to keep up on large # of offsets being pushed to the topic.
Home Grown or DIY Monitoring
Another challenge with operating your own monitoring stack for Kafka is integration. First you have to manage the integration with different generations of applications. If some applications store offsets in Zookeeper, while others store offsets in a Coordinator, you’ll need two separate configuration instances in Burrow. And as new versions of Kafka come out, and new applications deployed, these moving targets will eventually require you to update your Burrow instance to remain compatible. And of-course you must operate and integrate Burrow’s built-in notification system to your alerting system. There isn’t much that can be done to integrate the UI. Something like Kafka Manager gives you a single view of a cluster but doesn’t give you, say, Zookeeper monitoring, or host monitoring. So integration of these views is left as an exercise in customized integration.
Burrow Health Monitor and False Negatives
Finally, I will point out that the Burrow health monitor–the thing that says if lag is getting worse–is extremely conservative. It requires a string of consecutive lag increases, say 10 increases in a row, for Burrow to alert you to the Consumer Group becoming slow. I say this is conservative because while it certainly covers one case, it does not catch a lagging Consumer Group that reads bursts of messages. Perhaps it was designed this way to avoid false positives in practice–none could argue that lag getting continuously worse is indeed a problem. But if you use the newer Simple Consumer Group reference implementation, it can be quite difficult or close to impossible to trigger this required condition, because the Consumer Group has a burst fetch at the heart of its message loop. If you have a slow application, this burst fetch is very likely to produce occasional lag measurements where lag is better than the moment before, even though the overall trend is worse. Say for example the group has the following lag: 100 messages behind, then 200, 190, 300, 400, 390, 500, 600… An operator looking at that graph would say the Consumer Group is falling behind. A line drawn through the data points would show the noisy upward trend. Lag is growing, although there are these occasional bursts where lag slowed down briefly relative to the previous measurement. But in this example Burrow will report that this Consumer Group is in good health because in one place lag went from 200 down to 190. Think of lag acceleration as the model to see Burrow’s perspective. Lag accelerates +100, +100, -10, +110, +100, -10… Any slow down in acceleration will look like progress to Burrow. And it will continue to say things are fine, even as lag grows a million fold, so long as some measurement didn’t accelerate.
In the next part of this series, we will discuss how OpsClarity addresses the above limitations and provides a comprehensive solution to monitor Kafka and Kafka Consumer Lag.