We briefly discussed the issues with monitoring latency in our first post on Anomaly Detection alerts. As seen in the example plots shown below, latency exhibits occasional sharp spikes, where the latency is much higher than the normal range. This is due to a multitude of reasons which are frequently beyond the control of DevOps engineers. Such spikes normally do not pose any significant risk to the system or customer experience. Using standard methods, such as static threshold, SMA, etc., will cause such spikes to trigger alerts, significantly increasing alert noise and consequently causing significant distractions to the DevOps engineers and a resulting decrease in productivity.

OpsClarity monitors avoid this issue by looking at the persistence of unusual changes in latency. If the change was fleeting, it will be noted as a spike but it will not result in an alert or page. The change will only result in an alert if it is persistent for a certain length of time. In the example plots below, the blue line will not result in any alerts, as the latency spikes are very narrow and contained in time. On the other hand, the green line will trigger an alert, since there is a significant increase in latency that is persistent over a significant window of time.

Our model also accounts for the severity of the change in latency when deciding on the persistence requirements for an alert. For example, if the latency is twice the normal value, the increased latency will need to persist for a longer interval in order to qualify for an alert, than if the latency was five times the normal value.

As an example, the red line in the plot below will also generate an alert, even if the increase in latency resolved after some time. This is because the increase is latency was very significant (almost reaching 1 second). Such a significant increase in latency warrants an alert even if it is not persistent for a long time, as it could potentially have a severe impact on application response and user experience.


In the next post we will discuss the data science models OpsClarity leverages to alert for Errors.

In this series we look at each of the following models in detail:

  1. Queue Length in Messaging Systems (Kafka, SQS, etc.)
  2. Detecting Latency
  3. Monitoring for error counts and rates
  4. Detecting and accounting for seasonality
  5. Alerting on disk free space