Metrics that represent errors pose a special problem. Error is often used in a generic sense that could imply something very serious where any non-zero values warrants investigation, e.g., 5xx errors, or, it could represent something that has an acceptable baseline value but where an unusual change could indicate serious problems, e.g., 4xx errors, page faults, Memcache evictions, etc.

We have developed a model that accounts for both use cases and automatically arrives at the right thresholds for alerting. For example, 5xx errors represent a server-side issue and, for a normally operating system, it is typically zero. If you apply our error count monitor to a 5xx error count metric, it will alert as soon as a few non zero values are observed within a short interval of time. This is justified, as 5xx errors warrant immediate investigation to resolve any server-side issues. We do not alert on a single, non-zero value, as even with a well-performing system, one is bound to get an occasional 5xx error during the normal course of operation. As an example, we show a plot of 5xx errors below. For this plot, our monitor will trigger alerts for the cluster of errors around the 1.2-hour mark and 4.5-hour mark. But it will not trigger an alert for the lone error near the 2.75-hour mark.

Errors_1Unlike 5xx errors, the 4xx errors represent client-side issues, such as malformed URLs, forbidden actions, lack of authentication, etc. On a server that is handling a decent amount of traffic, it is not unusual to have a non-zero number of 4xx errors at any given time. Our error count monitor automatically accounts for this baseline and issues an alert only if the number of 4xx errors is significantly over the baseline level and persists for a unreasonable amount of time. In the plot below, we show two examples of such an error count. Our model will trigger an alert for the red line, but will not generate any alert for the green line, as the minor bump in its error count does not persist for a unreasonable interval of time.



In the next post we will discuss how OpsClarity leverages AI and Machine Learning to tease out Seasonality patterns.

In this series we look at each of the following models in detail:

  1. Queue Length in Messaging Systems (Kafka, SQS, etc.)
  2. Detecting Latency
  3. Monitoring for error counts and rates
  4. Detecting and accounting for seasonality
  5. Alerting on disk free space