Application variety, volatility and complexity has exploded in the last couple of years. Powered by cloud-native and containerized microservices, distributed data processing frameworks, and continuous delivery practices, the number of application components has increased exponentially compared to first-generation service-oriented applications (SOA), while the lifespan of instances has become much shorter. Additionally, these distributed microservices are developed independently by small teams as integrations of multiple open-source frameworks of their choice.

DevOps teams are stuck with two broad category of monitoring tools – Infrastructure Monitoring tools, and Application Performance Monitoring (APM) tools – each one of which works satisfactorily for traditional SOA based architectures. These architectures are primarily characterized by monolithic applications, connected to each other using SOA principles. These applications traditionally run on specialized hardware provided by Dell, HP, IBM, EMC etc. and leverage relational data stores.  Traditional infra-monitoring and application performance monitoring tools are adept at monitoring these SOA based applications and corresponding infrastructure, but fall short of addressing the challenges of monitoring modern and complex application architectures. Let us understand why.

Infrastructure Monitoring: Infra-monitoring tools (like Nagios, Graphite, etc.) primarily rely on collecting metrics and alerts from all the underlying applications and hosts. The alerting tools provide a dashboard with the alerts aggregated together, while the graphing tools provide a mechanism of plotting individual metrics against a timeline. This approach works for SOA based architectures, where the number or processes and hops are limited. However, monitoring modern dynamic and complex applications requires collecting and correlating metrics, events and alerts from 10s or 100s of open-source frameworks and ephemeral containerized infrastructure. Monitoring these environments with current tools results in 100s and 1000s of graphs. Collecting more granular data (which is fairly easy today) and plotting graphs against these data sets, does not lead to more visibility and easy troubleshooting. The DevOps engineer is now shackled with noisy alerts and numerous dashboards, not knowing which spike (or trough) or alert to look at, or even worse, which set of graphs to look at together as-a-group to evaluate cause & effect.

Application Performance Monitoring (APM): APM does a great job at understanding end-to-end transactions, starting from the point of end-user interaction, through the application stack (lot of custom code), down to the data store. They do this quite well, but the catch is, that they do so only for a limited set of traditional applications, written using custom code – APM transaction tracing is limited to code written in Java, .NET, Ruby, PHP, Node.js and a few other languages. These languages account for 90%+ of applications written between 2006-2013 – no-doubt APM is in a lot of demand. However, the key is to look at how applications have been built post 2013. APM tools fail to provide the same correlated transaction tracing for new open source data-processing and microservice code frameworks (Akka, Play, Grails, Spark, Storm, Kafka etc.), which have been the basis of new application development. It is here that DevOps teams are left in the dark.


Monitoring via Automated Discovery and Guided Troubleshooting

Recent advances in data-science applied to monitoring and troubleshooting have the potential to ease this pain, and OpsClarity has been executing on the same vision. “Data science” as a terminology has been overtly misused by several vendors, but the technology when applied can provide invaluable visibility and rapid troubleshooting for modern IT environments. Let’s take a look at the OpsClarity approach:

  1. OpsClarity automatically discovers your application topology, including your applications and all external services that your application may be using, without requiring any manual intervention or configuration. You can see how your application and various services in the topology communicate with each other.
  2. Configure and collect all your service metrics automatically (no more dealing with managing config files across all your hosts!) to enable quick browsing of your critical service metrics in one place.
  3. Since all metrics are not the same, you cannot apply the same base lining algorithms to all your metrics. We have developed a library of complex algorithms based on the context of your specific metrics (which we are constantly learning) and use them to automatically baseline your metrics. This allows us to detect anomalies in your metrics automatically.
  4. We provide automated health status for all applications and services. You can think of this as a “traffic overlay” on top of your application topology, and a quick indicator of the problematic areas of your infrastructure.

Now with this intelligent map of your entire application, overlaid with real-time health and service dependencies, you not only get immediate visibility, but when a trouble arises, you are guided to the mostly like point of origin of fire, rather than having to wade through 1000s of alerts, or 100s and 100s of graphs.

The above approach provides immediate, accurate and up-to date visual topology of your entire infrastructure within minutes, requiring no manual configuration, no plotting individual graphs or manually tagging metrics into logical groups. Also the intelligent and contextual health overlay, which is generated through application of advanced anomaly detection and correlation, guides you to the most likely area of concern and then quickly lets you determine the root-cause analysis.

In subsequent posts, we will be providing more details into each of these steps, how we are able to automatically create this topology and why the above approach leads to up to 10x faster troubleshooting.

At OpsClarity, we believe it is time for devops to break-free of tools that overload them with more and more of graphs, rather than providing intelligence and guiding them to the ones that they should really be focusing your attention on!