Everyday we are presented with products and services that keep getting smarter. Your phone knows when you have a meeting, knows the traffic conditions and route options, and tells you when to leave and the best route to take. The websites you visit understand your behaviors and buying preferences and tailor specific offers to meet your needs. And your credit card company automatically detects fraudulent purchases and immediately alerts you – they no longer wait for you to report an issue.


The systems that run these services rely on sophisticated data science algorithms that process massive amounts of data to understand what is happening in real time and infer what should happen next. The success of these systems is dependent upon advanced algorithms but is equally dependent upon the operations (Ops) teams who must keep them running optimally, and detect and resolve issues quickly. Unfortunately, today’s Ops teams lack the same advanced, data-science-driven technologies for monitoring and troubleshooting the business applications and systems they support. Tools they need to help them automatically map system relationships, learn expected behaviors, and proactively detect issues.


So, we asked ourselves “What if we could make operations monitoring as intelligent and sophisticated as the applications these teams support?” Our driving idea was born from many conversations we’ve had with operations experts who expressed their frustration with current tools. They said things like, “I’m monitoring a million metrics a minute and can plot them in Graphite/Grafana, but then it’s a hunt and peck exercise to find issues.”  Or they said, “Don’t tell me everything that’s happening with my system, just tell me what I should focus on now.” While progress has been made on collecting metrics, we still hear the same story – there is too much data, too many graphs and dashboards, and too few insights delivered from operations software.


We then got to work building a next-generation operations analytics and monitoring platform that anticipates and meets the needs of DevOps teams supporting today’s web-scale application environments. We focused on three primary challenges that are the hardest to solve with existing tools.


1. Integrate, process, and analyze all the data that is important to operations.

The age of web-scale applications has created a massively complex challenge for modern software operations. Today’s applications are cloud-native and powered by containerized microservices – and the number of microservices and instances per microservice are expanding exponentially. All of this has created an environment where applications and infrastructures are constantly changing and there is an explosion of operational data to understand and analyze – including metrics, events, and alerts from cloud infrastructures, containers, microservices, open-source frameworks, and custom applications.

Our platform uses data science and advanced correlation and anomaly algorithms to address this problem by synthesizing the massive volume of data generated by modern web-scale applications. We have developed a library of complex algorithms based on the context of your specific metrics (which we are constantly learning) to automatically understand, synthesize, and baseline this data so it is available to you in a way that drives focus and delivers new insights.


2. Visualize the data in a way that creates immediate understanding.

Visualization is key to understanding massive data sets. For example, maps and navigation data is vast and complex, but it is not presented in charts and tables. Instead, online maps use a novel visualization paradigm that makes the data instantly understandable through a hierarchical view, overlaid with traffic on each street and the accidents and diversions along your route. And all of this information is constantly updating, even as your location is constantly changing.

Having built navigation systems for years in his past life, my co-founder realized how we could apply this visualization model to operations data. Our hierarchical view provides a layered visualization of application topology, overlaid with component health for an immediate understanding of overall system health. This visual paradigm helps Ops teams identify what matters most and what requires the most immediate attention. Think of it as a guide navigating you through the troubleshooting process instead of having to rely on intuition or searching for needles in a haystack.


3. Provide the most powerful investigative tools for root cause analysis.

One thing we’ve head many times from Ops professionals is that one good alert is always better than ten alerts that don’t matter. To identify that one important alert, you need to have a deep understanding of all system components and their relationships and behaviors, so you can proactively detect anomalies and diagnose their importance. This is where data science can help.

Our data-science-driven approach synthesizes all metric data and applies aggregated health models to show health status for all services overlaid on an always up-to-date application topology. This enables you to identify problem areas in your dynamic infrastructure at a quick glance. You can also dig deeper by analyzing anomalies that have been proactively detected and prioritized, or exploring the event log which captures all relevant events from across the entire stack. You can quickly troubleshoot by viewing all of these events in the context of the application topology, exploring events on connected services, replaying system state, and using an extensive set of event filters to quickly drill down into the specific events you need to debug.


The OpsClarity Intelligent Operations Platform.

Today we are announcing our innovative new solution – the OpsClarity Intelligent Operations Platform. We truly believe it is going change the game for operations and bring a new level of sophistication and data-science-driven insights to DevOps teams responsible for managing web-scale applications. Today, this operational sophistication is enjoyed by only by a handful of Internet giants, but with OpsClarity it will become the norm for all Ops professionals. We invite you to try it and share your feedback at www.opsclarity.com.