The standard models, such as SMA, EWMA, etc., fail in the presence of trend or seasonality conditions. We have seen the effect of trend with metrics representing queue size. Many metrics representing important business concerns also exhibit strong seasonal behavior. For example, the number of active users on an e-commerce site shows both daily and weekly seasonalities, where one sees peaks during evening hours and troughs during late-night hours. Also, the weekend peaks may be higher than weekday peaks. While metrics such as latency may also reflect some of the seasonal behavior, we can safely ignore the seasonality, as the seasonal component in such metrics is not particularly dominant in a well-designed system that is capable of handling varying amounts of traffic. But the story is different for metrics such as active users count, request count, etc., where the value at the peaks may be an order of magnitude higher than the values at the troughs. In these cases, the seasonal component is too dominant for standard AD methods to work without accounting and correcting for the seasonality. We have developed models precisely for this use case. Using this technique, we can flag unusually high request counts that could potentially occur during a period of low activity, even when the request count is lower than the global peak.

As an example, we show a seven-day-long time series for a request count metric below. The plot clearly demonstrates the daily cyclical nature of request counts, along with weekend effects. Because our model accounts for the seasonal effects, we are able to identify and alert when we encounter the unusual jump in the request count near the one-day mark. The unusual spike might be the result of an expected surge in the number of users, caused by a special, time-based promotion, or, it could be the result of a code deployment that misconfigured some settings, resulting in a surge of redirections that ends up inflating the request count. In either case, it is good to catch this event from both business and DevOps perspectives.

Requests

In the final post of this series, we will discuss how OpsClarity optimizes alerts for fee disk space using machine learning.

In this series we look at each of the following models in detail:

  1. Queue Length in Messaging Systems (Kafka, SQS, etc.)
  2. Detecting Latency
  3. Monitoring for error counts and rates
  4. Detecting and accounting for seasonality
  5. Alerting on disk free space