Application architectures and deployment paradigms typically evolve every 5 to10 years. We are in the midst of a major evolutionary change right now. Modern applications are increasingly relying on stateless microservices, are often paired with stateful data services (like NoSQL, Kafka, Hadoop etc.), and are being deployed on containers or leverage serverless architectures. As the application substrate is changing, so are the characteristics for how these new applications are best monitored.
The focus of this post is not to discuss the merits of the new architectural paradigms, but is instead to analyze how IT operations and monitoring tools need to evolve in order to keep up with these new paradigms. For the last two decades, infrastructure (or server) monitoring, application performance monitoring, and network monitoring have been accepted as the essentials for efficient and effective management of IT and datacenter operations. However, with the new class of microservices, data-first applications, serverless architectures, and continuous integration practices, are these traditional monitoring approaches being rendered obsolete?
Let’s take a look at a few of the architectural and deployment paradigms that are changing the application substrate:
- Microservices Architecture: Monolithic application architectures are being broken into suites of independently deployable, small, modular services, which run as independent processes and communicate with each other either synchronously or asynchronously. In this architecture, traditional transaction tracing-based APM monitoring approaches can be used to look inside the individual microservices, but they typically lack completely visibility into the composite application itself. APM solutions struggle with the scale of decoupling (between the microservices), or the asynchronous pub/sub communication patterns the microservices might be leveraging . Whether it be first generation APM tools like Wily (now CA), or ones that arrived late last decade, ie New Relic or AppDynamics, most APMs are designed to support monolithic or SOA based architectures, but not microservices architectures.
- Logical Data Centers: There has been a gradual trend over the last decade towards commoditization of servers (no more specialized Sun, IBM or Dell servers). With the advent of Software Defined Datacenter (SDDC), IaaS, and orchestration systems like Kubenetes, Mesos DC/OS etc., enterprises can spin-up logical data centers on the fly. Orchestration technologies like Kubernetes add another layer of abstraction on top of server resources like compute, storage, memory and network bandwidth. While server/infrastructure performance and availability still matters, application developers and DevOps engineers are primarily concerned with delivering business-critical applications rather than worrying about server capacity and availability. Orchestration engines or managed cloud offerings make infrastructure performance secondary attributes. Since most infrastructure/server monitoring tools look solely at infrastructure metrics like CPU, memory, disk usage etc., their relevance for applications and services deployed on logical data centers or public clouds is diminishing. Tools like Nagios, Sensu, Icinga and even more modern alternatives like Datadog fall in the service/infrastructure monitoring category. Though some of these tools also capture application metrics, one needs to rely on manual tagging or config files to correlate application metrics with server metrics.
- Open Source: The IT stack is being redrawn atop powerful open-source projects, with developers in almost every enterprise are opting for an “open-first” approach to building applications. Even large traditional enterprises like Capital One, Walmart, GE, Pfizer, Bank of America, are fleeing the safety of established tech vendors for the promise- of greater control and capability with open software. With the wide variety and maturity of readily available and consumable open-source frameworks, developers are spending less time writing custom code and more time composing several open-source frameworks into logical applications. This movement to composable applications is reducing the efficacy of byte-code injection monitoring tools such as APM.
- Data-First Applications: With the maturity of NoSQL, Hadoop and stream-processing (Spark, Storm, Flink) frameworks, we are seeing a new breed of applications that are designed to capture value from fast-moving streaming data, before it reaches Hadoop or SQL/NoSQL stores. The goal of these fast-data applications is to capture real-time value from the data the moment the data arrives and is powering use cases across intrusion and fraud detection, real-time customer targeting and personalization, etc. These applications are usually built using the SMACK stack and monitoring these applications, that are based on a data-pipeline, requires correlation across the different components rather than looking at each component of the stack in isolation.
While these new application paradigms are gaining adoption, DevOps engineers are finding themselves straddled with out-dated, traditional server/infrastructure and APM monitoring tools. Let’s look at why these approaches are limited and why we need a monitoring approach that looks at service performance as a first class citizen, and allows both correlations across the multiple services (microservices) as well as allows drill down vertically into the layers of the underlying infrastructure, rather than a bottoms up approach.
Application Performance Monitoring (APM)
APM provides end-user experience insights through a distributed application and its components. APM tools are able to monitor and trace transactions across tiers of the application and are critical to monitoring “code-first” or monolithic applications. However, APM is limited when it comes to monitoring a new breed of applications that are increasingly composed of several open-source frameworks. APM comes up woefully short when applied to modern applications, given its inability to look beyond the code layer. The first problem is that it can’t see into the inner workings of distributed open-source or data frameworks. For example, detecting an Apache Kafka broker as a remote call is necessary but not sufficient. Automatic discovery of Kafka clusters, automatic configuration of metric data collection and automatic detection of performance anomalies are must-haves for monitoring Kafka infrastructure health, something APM struggles to provide. The second problem is traditional APM’s inability to provide correlated troubleshooting for the new breed of applications. Since distributed transaction tracing is not feasible for these modern applications, there is no correlated, end-to-end performance view across the various open-source components that can serve as the foundation for troubleshooting. Moreover, if you have an asynchronous, pub/sub messaging architecture, transaction tracing becomes an excessively arduous task.
Infrastructure/Server Monitoring provides an understanding of system resource usage, which can help you improve capacity planning and provide a better end-user experience. It also provides active insights into a server’s system resources, such as CPU Usage, Memory Consumption, I/O, Network, Disk Usage, Processes, etc. An application’s health depended in large part on the health of the underlying servers in the traditional datacenter, where servers are allocated to specific applications. Infrastructure Monitoring ensures that your server is capable of hosting your applications. There are some modern infrastructure tools that collect all application metrics, however, the correlation between the applications and servers is either nonexistent or achieved through tedious manual tagging, which can soon become outdated. Tagging is a useful concept, but leveraging it as a mechanism for defining meta-data for complex and dynamic modern applications soon becomes excessively cumbersome.
Need for Service-Centric Monitoring
Monitoring for modern applications needs to provide a top down view, as the application and services remain the most critical and important elements. It is important to understand the health of these applications, whether custom code, packaged applications, or open-source frameworks. IT/Ops or DevOps need a consolidated view and assessment of the forest before they drill down into the trees. It is also important to understand how each service is dependent on others, as failures typically don’t occur in isolation. Once a service failure is identified, it is then important to understand whether it is in the application/service tier, or due to failure of underlying infrastructure, whether it be containers, OS, or the server itself. This requires that your monitoring solution should be cloud-native or container-native, so that it can understand the nature of the infrastructure your services are running on and provide you real-time monitoring even as the underlying infra might be dynamically shifting.