In our previous blog post (Drowning in Alerts: Blame it on Statistical Models for Anomaly Detection), we talked about how standard anomaly detection constructs, such as SMA, exponential smoothing, etc., do not work well when it comes to Ops monitoring.

At OpsClarity, we view monitoring as a multi-layered activity. For example, you may discover a behavior in your system or application that is interesting or warrants attention, but it is not serious enough to issue a page that pulls the on-call DevOps engineer  away from other tasks. In another instance, there may be issues where you will want the on-call DevOps engineer responsible for the system to immediately respond and investigate.

Consider, for example, the latency of a node server. Latency is a metric that frequently exhibits stochastic behavior that can’t always be precisely predicted. You expect latency to show occasional spikes that resolve on their own without any external intervention. Such spikes can stem from a multitude of causes, including periodic jobs/crons that run on the server, log rotation, garbage collection, etc. Sending a page that alerts the on-call DevOps engineer to such spikes is not a preferred course of action. You would be interrupting the DevOps engineer’s work with an issue that would have resolved on its own.

But what if the latency of a node server shows a marked change that is persistent? If the latency is, for example, increasing by 100% and then is not recovering over the next several minutes, you would want your on-call DevOps engineer to immediately investigate the issue, as it could have a material impact on customer experience and required performance on SLAs.

Between these two examples lies the case in which the number of occasional latency spikes show an uncharacteristic change. You would want to know about such a change from an overall system health point of view, even if it does not warrant a page or an alert. For example, what if, under normal operation, you expect one latency spike every hour or so, but, for some reason, you are observing four or five latency spikes every hour? You would want such a change to be reported in some form (if not an alert or a page), because it could reflect a change in the system configuration that could be resulting in sub-optimal behavior. Reporting such an event as an alert or page would, of course,  increase alert noise for an event that most likely doesn’t require immediate attention. At OpsClarity, we believe such an event is best handled when the DevOps engineers are not busy with other pressing issues, and that alerts and pages should be issued only for events that require immediate attention from DevOps personnel.

For events such as the marked and persistent increase in latency mentioned above, alerts and pages are fully justified. Otherwise, our tools are designed to drastically reduce alerting noise that is unnecessarily disruptive and distracting. We know that significantly reducing unnecessary alerting noise will substantially increase DevOps efficiency because they won’t be constantly distracted from their tasks at hand.

For events such as the increase in latency spikes, we are currently working on an analytics insights feature that will report such events in a form that will be easily understandable by anyone –from DevOps to CTO — concerned about overall system health, trends and efficiency.

In the next few posts, we will look at how our monitors deal with the issue of alerting noise and the creation of  events only when immediate attention is warranted. We will look at some of the data sets where this problem manifests itself with the use of traditional models, such as static threshold, SMA, etc., along with how the models used by OpsClarity helps in better monitoring such metrics.

In this series we will look at each of the following models in detail:

  1. Queue Length in Messaging Systems (Kafka, SQS, etc.)
  2. Detecting Latency
  3. Monitoring for error counts and rates
  4. Detecting and accounting for seasonality
  5. Alerting on disk free space