The available space on the disk (diskfree) has a significant and often catastrophic impact on applications and services running on the system. For this reason, every DevOps engineer knows that it is crucial to carefully monitor disk usage in all critical systems, especially ones that tend to rapidly use up disk space, such as heavily used Hadoop stores, applications with extensive logging, Kafka clusters with a long retention period, etc. The most common monitors used for diskfree metrics rely on a static threshold where the threshold is set by a DevOps engineer with intricate knowledge of the system and applications running on the system. For example, a DevOps engineer may choose to set a static threshold at 5%, i.e., the monitor will trigger an alert if diskfree falls below 5%. In our experience, this approach is inefficient for several reasons. It is not trivial to figure out the appropriate threshold for a given system. In some cases, you may want the threshold to be somewhat higher, especially if you know that the rate of disk consumption can be particularly high on those systems. Moreover, some systems may have a very stable, but high, disk usage. In those cases, you do not want the monitor to trigger an alert even if diskfree is below the threshold, i.e., under 5%, as long as the disk usage is stable and unchanging.

Based on our experiences with customer systems, we have developed a disk monitor that takes into account both the current disk utilization and the time characteristics of the disk usage patterns. For example, our monitors will not trigger an alert even if diskfree is low, as long as it is not actively dropping. This way, we  significantly reduce alert noise for the DevOps engineers. However, the monitors will trigger an alert even if the diskfree is high, if we see a significant increase in the rate of disk usage that significantly increases the likelihood of a diskfull scenario.

As an example, we show three different diskfree plots below. Our monitor will not issue an alert for the disk represented by the blue line, even though the free space on this disk occasionally falls below 5%. This is because the disk usage is fairly stable and unlikely to change drastically. On the other hand, we will issue a warning alert for the disk represented by the green line around the five-day mark (when the estimated time until the disk runs out of space is less than three days). This alert will be escalated to a critical alert if the model predicts that the disk will be full in less than eight hours. With the disk represented by the red line, the monitor will issue a warning alert somewhere between the five-day and six-day mark, when it predicts that the disk will be full in less than three days if the usage trend continues. The model will issue this alert even if the diskfree is over 30%, because of the high rate at which the disk space is being consumed. This way, our model transforms the diskfree value to a more transparent time left on disk value, thereby creating meaningful alerts that work well on systems with varying characteristics.

DiskFree

As another example of our diskfree alert, we have added a screenshot of our product showing the model in action.

Screen Shot 2016-10-11 at 11.12.54 AM

In this post we have provided a taste of how OpsClarity combines deep operational experience and insights, with robust analytics, to provide extraordinarily useful monitoring tools that are not only easy to set up, with minimal configuration, but also significantly improve monitoring efficiency by reducing alerting noise.

In this series we look at each of the following models in detail:

  1. Queue Length in Messaging Systems (Kafka, SQS, etc.)
  2. Detecting Latency
  3. Monitoring for error counts and rates
  4. Detecting and accounting for seasonality
  5. Alerting on disk free space