The use of message queues/brokers is ubiquitous in any real-time application. Such intermediate modules (e.g. Apache Kafka, RabbitMQ, AWS SQS, etc.) improve system reliability by decoupling the producers from the consumers, thus freeing them from any synchronization requirements. A primary operational concern with such systems is whether the consumers are keeping up with the producers, i.e., the messages are not piling up in the queue or log while waiting to be processed.
Many of these services provide a metric that represents the lag between the consumer and the producer, for example, ApproximateNumberOfMessagesVisible in AWS SQS, and consumer lag in Kafka. Typically, DevOps engineers use a static threshold-based monitor for queue-size metrics. For example, they set a monitor to trigger an alert if the number of unprocessed messages in the queue exceeds 1,000 or 5,000. Using a static threshold in this manner is severely prone to false alerts, especially when the number of messages sent by the producer is sporadic and arrives in bursts. Moreover, it is extremely tricky to figure out the appropriate static threshold, as it could be very sensitive to the producer and consumer characteristics. Even within the same Kafka system, the appropriate static threshold could be widely different for different topics. The problem can be alleviated to a certain degree through the use of window averaging (such as, moving average or exponential smoothing) before applying the static threshold.
But this still does not address the core question of determining an appropriate alerting threshold. Taking a step back, it is easy to see that the primary concern is not the number of unprocessed messages in the queue, but whether the consumer is keeping up with the producer. For example, depending on the application, we could have a producer that could sporadically dump a million messages to the message queue. Such a producer is likely to trigger any static threshold monitor, even one that uses an averaging window. But, the more pertinent question that we should be asking is, will the consumer be able to trundle through the message dump before the next big message dump arrives? Our queue-length monitor does precisely this and triggers an alert only if there is adequate evidence that the consumer is not able to keep up with the producer and needs intervention from the DevOps engineer. This means that our monitors will not trigger an unnecessary alert even if the producer is highly sporadic.
Moreover, our monitors will trigger an alert even if the queue size is ramping up in an undulating manner. As an example, we show two different time series above, both representing queue size metrics. The blue line shows a queue where the producer sporadically dumps a large number of messages, but the consumer is able to quickly process the messages. OpsClarity monitors will not trigger any alerts for the queue represented by the blue line. On the other hand, the queue represented by the green line appears to be doing well until about the 3-hour mark. After the 3-hour mark, it is clear that the queue size is gradually ramping up, although in a stochastic and undulating manner. For the green queue, on average, the consumer is not able to keep up with the producer and consequently our monitor will fire an alert.
In the next post we will discuss the Latency models leveraged by OpsClarity.
You can access posts about all our models below: