Data science, artificial intelligence (AI) and machine learning have transformed e-commerce, personalization, digital marketing, search engines etc. Large scale analysis of data has become a powerful tool for businesses to create competitive differentiation. Seemingly left out of this, however, is IT operations – the place where all this incredible computation for data science takes place. This is beginning to change.

OpsClarity is leading the charge to bring efficiency and sophistication of the most current data science and machine learning concepts to IT Ops and monitoring. Our vision has been to bring cutting-edge innovations in data science and machine learning from the consumer world to the Ops world. To do that, OpsClarity has built an intelligent assistant for DevOps engineers that learns applications and system environments, detects and correlates failures and makes recommendations to drive increased focus and productivity. All this as everything is continuously changing at speeds never anticipated. Now with the advent of containers and dynamic orchestration systems like Kubernetes and Mesos, the need for data science led discovery and correlation of concerns across hundreds of dynamic microservices becomes even more critical. Collection of metrics and visual correlation of metrics or graphs will not scale when you have such high level of entropy or scale in your environment. Machine learning and data science is well suited to learn failure patterns and detect them quickly and accurately as they re-occur.

The OpsClarity platform has several data science and machine learning constructs that are specifically designed to manage the hyper-scale, hyper-change microservices architecture of large, complex and distributed data intensive applications. The platform has been built with the specific goal of significantly improving the troubleshooting workflow for these applications. It was designed from the ground up to handle the massive volume of data generated by modern web-scale applications by applying data science and advanced correlation and anomaly detection algorithms. However, since every application and metric is different, the same algorithms or anomaly detection techniques cannot be applied to all the metrics that are collected. Based on the context and history of the application and metrics, the platform constantly learns system behavior, understands the context, and chooses the appropriate combination of algorithms to apply. This intelligence is built into the platform’s engine, called the Operational Knowledge Graph.

OpsClarity applies data science and machine learning at several levels

  1. Automatically detect changes in environment and configure metric collection
  2. Aggregate services into a logical cluster
  3. Develop deep understanding of each component (like Kafka, Docker, Spark, Cassandra etc)
  4. Collect most critical metrics for each service.
  5. Apply a very specific anomaly detection model for the critical metrics, rather than using generic statistical models
  6. Correlate events and concerns across multiple services to reduce alert-noise and facilitate rapid trouble shooting.
Correlate failures across multiple services to isolate 'root-cause'

Correlate failures across multiple services to isolate ‘root-cause’

OpsClarity Operational Knowledge Graph

At a high level, the OpsClarity Operational Knowledge Graph captures the kind of domain knowledge that an experienced operations engineer would have built up over time. Our goal is to use this knowledge to drive the next level of operational analytics features, including faster troubleshooting workflows, more accurate alerting, and richer visualizations.

The Operational Knowledge Graph (OKG) synthesizes data to create an understanding of the infrastructure significantly beyond just hosts, metrics and tags. It automatically builds a knowledge graph of the application and infrastructure, and it includes many important innovations:

  1. Understanding of the operational data model
    1. What are the important processes running on hosts or inside containers?
    2. How are hosts / containers clustered into services?
    3. How are services grouped into applications?
  2. Visibility into how services and applications communicate between each other, otherwise known as the application and service topology
  3. Deep domain knowledge on the individual services, including:
    1. Knowledge of the most critical metrics and events to collect for each service
    2. Knowledge of the specific kinds of failure patterns to detect from the metrics for each service, enabling customized baselining and anomaly detection algorithms for each service
    3. Knowledge of how failures propagate across services, providing identification of the probable root causes of issues faster
  4. Historical snapshots of all the data sets mentioned above, providing a view the system state at any point in time

Automated Topology and Change Detection

OpsClarity leverages data science to automatically discover application topology, including applications and all external services that your application may be using, and always keeps it up-to-date. One can see how applications and various services in the topology communicate with each other. And as infrastructure scales up/down, the topology view also updates in real-time. This is achieved through sophisticated agents that integrate with the underlying OS kernel and detect any changes in process(s) and network connections. OpsClarity analyses process signatures and command line parameters to determine not only what type of service is running, but also the respective versions of the service and which other services it is communicating with. Furthermore, OpsClarity pulls data from orchestration platforms like Kubernetes and Mesos to get additional information as well as apply intelligence for the respective overlay networks.

Topo

Automated Anomaly Detection

Anomaly Detection is a well studied problem in the signal processing and statistics fields. It deals with the problem of finding abnormal data points in a given set of data. There has been a tremendous amount of research on this topic over the last several decades.

In the context of IT operations, the data is usually streams of metrics collected from all levels of the infrastructure – hosts, containers, services, applications. The challenge of anomaly detection is finding a region of time where a given metric stream has abnormal behavior compared to other regions of time. Due to the huge variety in the types of metrics, there is no single technique that works across all metrics and in all application environments. The most popular technique for anomaly detection – moving average + standard deviation – works well on data that has a normal distribution.

AD1AD2

But, most metric data is not normally distributed. In that case, the technique gives a lot of false alerts. AD5AD6AD7

Other well-known techniques, like the Holt-Winters approach, try to find seasonality in data. But, they only work well in a limited number of cases.

The OpsClarity approach to anomaly detection is unique in that we do not rely on one single technique. We do not look at the problem purely as a time-series analysis problem. Instead, we combine our domain knowledge with the context we have from the OKG to create several highly specialized anomaly models. The OKG tells us what services run on a host, and which metrics are the most important to be collected for each service. For each service and metric combination, we have a statistical model used to detect anomalies from the metric. The models are tuned using curated domain knowledge to find the exact failure for each specific metric.

One simple example is the anomaly model for the “pending syncs” metric for Zookeeper. The OKG knows that in a healthy Zookeeper cluster the number of pending syncs should be at 0. So, the anomaly model for pending syncs triggers an anomaly if the number of pending syncs is non-zero for an extended period of time.

disk-anomaly

Kafka-anomaly

Kafka-anomaly-2

We have built custom models to monitor important operational and business impacting metrics like latency, throughput, requests, errors, etc. for a few services that are of prime importance to data-first applications, including Kafka, Storm, Redis, Memcached, Spark, Cassandra, etc. Some of these models include:

  1. Latency model: This model detects unusual behavior in latency metrics for several services.
  2. Connection count model: This model tracks the number of connections to various services and alerts on unusual behavior.
  3. Error count model: This model dynamically adapts to an acceptable baseline for errors and alerts if there are more than usual number of error, for example 4xx and 5xx error on ELBs.
  4. Queue size model: This model alerts if the consumer is not able to keep up with the producer in a queue service like Kafka, AWS SQS, etc.
  5. Disk Full model: This model alerts if the hard disk consumption shows an increased consumption rate or if the disk is predicted to be full within a preset amount of time.
  6. Seasonality corrected threshold: This model detects unusual behavior, in a seasonally adjusted fashion, in metrics related to incoming traffic for a number of services.
  7. Mean shift model: This model alerts if there are successive occurrence of sudden shifts in the level of the metric under consideration.
  8. Dynamic threshold model: This model alerts if a metric is persistently above a temporal dynamic threshold.

kafka metrics

Conclusion

OpsClarity has developed an intelligent, intuitive, and less cumbersome way of monitoring modern applications and data frameworks. We have developed a radically new operations monitoring solution based on data-science and machine learning that automates many of the time-consuming manual steps that an operations engineer has to execute on a daily basis. If you are an operations engineer on the frontlines of managing cloud-native, modern web-scale apps, here are some of the most important ways we can help with :

  1. Automatically discover your application topology: You can now see how your application and various services in the topology communicate with each other. And as your infrastructure scales up/down, your topology view also updates in real-time.
  2. We configure and collect all your service metrics automatically (no more dealing with managing config files across all your hosts!) to enable quick browsing of your critical service metrics in one place.
  3. Since all metrics are not the same, you cannot apply the same baselining algorithms to all your metrics. We have developed a library of complex algorithms based on the context of your specific metrics (which we are constantly learning) and use them to automatically baseline your metrics.
  4. In order to manage the false positive noise, we have sophisticated anomaly-ranking algorithms. You can now focus on the most critical anomalies and not have to deal with noise from false positives.
  5. We provide automated health status for all applications and services.
  6. We have a DVR-like view that allows you to go back in time and replay failures to understand how they propagated throughout the system. This is very useful for quick anomaly correlation and to deep-dive into a specific failure.