This is part 1 of our Docker Monitoring blog series. In this part we will discuss the basics about Docker and container monitoring. In the next part, we will discuss how OpsClarity Intelligent Monitoring platform automatically tracks Docker containers and monitors the services inside them, even as they shift and the Docker images change, without you having to manually configure anything!
The world is in a thunderous stampede to adopt containers. After all, it has been the most disruptive and powerful innovation that both promises and delivers astonishing advances in system manageability, cost-efficiency, portability, security and more. But, as Solomon Hykes will tell you, the innovation of containers is not the actual containers—that technology has been around for nearly a decade. The innovation of containers is what you can do with them.
Containers have empowered IT professionals to bring their microservices architectures to life by providing a platform that enables distributed applications to be built, shipped, run, and quickly orchestrated across multiple infrastructures in a scale-out fashion. What started with microservices, is now quickly spreading to the data-tier, with businesses containerizing their Spark, Cassandra, Kafka, SQL and NoSQL databases etc.
Relationship between containers and applications
Containers run services. Service monitoring is of primary importance both in containerized and non-containerized worlds. When we talk about Resource usage of a container or a host (CPU, Memory, Disk, Network etc), it is always in the context of a service that we’re running. For example, if you’re running Kafka and running out of disk space, you’d say “My host / container is running out of disk space BECAUSE Kafka is filling up the disk quickly”. Another example is if you’re running Spark and running into memory issues, you’d say “A host/container is running out of memory BECAUSE a Spark job has too many shuffles leading to garbage collection issues”. As you can see, Resource usage is contextual and always ties to a service running inside a host/container. The typical monitoring goal is to understand how computing resources are being used at any given moment and the services that are contributing to it under those circumstances.
Orchestrators, Images and Networks
There are primarily 3 popular orchestration engines for Docker:
- Kubernetes, or “k8s,” which automates the deployment, scaling, and operations of application containersacross clusters of hosts;
- Docker Swarm, Docker’s native scaling solution for distributing compute resources to distributed applications by using clustering capabilities to turn a group of Docker engines into a single, virtual Docker Engine;
- Apache Mesos, which Docker recommends for orchestrating mega-clusters needed for giant scaling.
In order to effectively operate your containerized infrastructure, you need to be able see and understand how the services provided by these technologies are performing inside containers, but to do that, we’ll need to figure out how to automatically configure metric data collection for these services, with zero friction, and without damaging the built-in security layer Docker provides.
Accounting for ephemerality of containers, let’s look at the monitoring problem for containers and understand how to monitor them.
You may recall that the original Linux kernel solution for containerization was the creation of images instead of virtual machines. Docker images are essentially the shippable components that have all the necessary dependencies packaged.
Every Docker image is stamped with an image ID such as “2b8fd9751c4c0f5dd266fcae00707e67a2545ef34f9a29354585f93dac906749.”
An image can also be referred to by the Repository in combination with a tag. For example: busybox:latest
root@vagrant-ubuntu-trusty-64:/shared# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
busybox latest 2b8fd9751c4c 3 weeks ago 1.093 MB
nginx latest b1fcb97bc5f6 7 weeks ago 182.8 MB
tomcat latest a22c8a8d9ca8 8 weeks ago 357.2 MB
jplock/zookeeper latest b21362c42439 9 weeks ago 155.4 MB
Networks are created to provide isolation for containers. Docker creates a default network to provide isolation for every container, but it also allows for the creation of user-defined networks.
The default setup creates three networks on every host – bridge, none, and host. When a new container is created, you can set the network for which that container should be a part. By default, Docker sets up the container on the bridge network, which represents the docker0 network. You can run the ifconfig command to see this network on a Docker-enabled host.
Within each network, the container receives a unique IP address through which it can communicate.
To retrieve information about a network, say a bridge network, you can use Docker network inspect bridge command. As you can see below, a subnet is created and a container that we just launched is added to this network by default and receives an IP address in that subnet.
root@vagrant-ubuntu-trusty-64:/shared# docker network inspect bridge
Now that we have covered the basics of Docker, lets dive into the challenges of monitoring such a dynamic, ephemeral and containerized environment.
Application metrics and container metrics
To collect container specific metrics such as CPU, Memory, Network, Disk usages of a Docker container, there is a well-established Docker Stats API. It provides detailed information about the performance of the container. However, the real value with monitoring comes when you can actually monitor the application (or service) running inside the container with availability checks, collect application specific metrics just like how you do today when running on a host. Being able to collect and analyze service level metrics is the only way to have a production ready monitoring setup for Docker.
Running monitoring agents inside the container is not the right way to go about it. You should not be bundling a monitoring agent within your container – it breaks the very rule of containerization of having one service that is well contained. Instead, the monitoring agent must be minimally intrusive and find ways to get application metrics from outside the container.
Once you figure out a way to collect both service and container metrics, you also need a way to obtain aggregated views of a service running across multiple containers, and an aggregated view of the performance of the containers.
For a useful production setup, your containers are typically orchestrated by one of Kubernetes, Mesos or Docker Swarm. Starting with a specification of how your containers are interlinked, these orchestrators handle setting up of networking configurations so containers can talk to each other across hosts. You can specify the number of instances of a container you want in your environment and spinning up of containers, monitoring their lifecycle is taken care of by the orchestrators. The orchestrators treat the hosts available to them as a large pool of resources and can create / destroy and re-create containers on hosts based on internal optimization algorithms. With Orchestrators in control of containers, how does one setup monitoring so that you can still collect application metrics from containers that are in flux? Each time a new container is created, it gets a new IP address. How would one setup availability checks, collection of service metrics from specific ports if the IP addresses are not static? Like we mentioned before, having monitoring agents inside every launched container is a clumsy solution. Your monitoring agents need to understand container movement and follow them around so they can collect the metrics irrespective of where the container is relaunched.
In the above case, there’s an application “CreditCheck” running in container C1 and a Zookeeper in C2. They initially get an IP address assigned. Now, C1 can be destroyed and recreated as C3 with a new IP address. Can your monitoring setup recognize this and move it’s data collection from 172.17.0.1 to 172.17.0.3?
Cost of Never ending configuration management
The real problem of monitoring applications within a Docker ecosystem boils down to configuration management of your monitoring setup. Configurations cannot be static anymore. You cannot rely on static config or YAML files to represent your dynamic infrastructure. For example, you could have a Zookeeper container and a Kafka container on host 1 and your containerized application (App1) along with a containerized database like Elasticsearch on host 2. Let’s say you setup your monitoring configuration on host 1 to collect Zookeeper and Kafka metrics, and, monitoring configuration on host 2 to collect your application and MongoDB metrics, you have a static configuration. How do you manage your configuration if the Kafka container moved from host 1 to host 3, or if your application container was killed and recreated on host 1? Would you move your monitoring configuration files across hosts when this happens? Ouch! That’s a hard problem. Developers are constantly experimenting with newer services as well. So each time a new service is deployed, you need to go back to your config files, understand the new services and handwrite configurations. The cost of managing your configurations keeps going up. You need a monitoring tool that treats monitoring and configuration management as first class citizens and solves both. If not, you’ll end up spending endless hours editing files and handwriting configurations just to see it change again.
The Port Mapping Mess
If you want to make your application running inside a container accessible to the outside world, you need to map the listening port on the container to a port on the host. This is because the ecosystem outside the host does not know about networks created by the host internally (for example, the docker bridge network). Since monitoring agents typically sit on the host (and not inside the container), some monitoring tools also force mapping of containers ports to host ports, where metrics are made available (for ex. JMX ports) so they can collect service metrics. This starts to become another messy manual configuration step. In addition to that, if you’re performing availability checks on the mapped port on the host to see if the port is listening, you will always get a success message which is incorrect. Let’s see why that is.
When you map a port inside a container to a port on the host, Docker creates a separate process that opens up a proxy port on the host. This process is responsible for forwarding any incoming data on the host to the port inside the container. Even if the process inside the container dies, the Docker proxy process stays alive. So, port checks on the mapped ports always succeed. To perform a more realistic check, you need to look at the port inside the container. Monitoring tools should be smart enough to see the internal network and do availability checks and collect data from the ports on the network inside the container.
Auto-scaling and false alerts
The big advantage with containers is when it comes to auto-scaling. Whether you need to scale up the number of service instances due to higher load or scale down during weaker workloads, Docker enables quick startup / destroy times since images are pre-built. With auto-scaling comes another challenge of monitoring and false alerts. When scaling down, instances are being destroyed. Would your monitoring system be able to understand and correlate this with weaker workloads? For example, if number of Nginx containers goes down from 5 to 2, is it because there is an issue with the Nginx service or was it done on purpose to account for the lower workload? Static thresholds with a min and a max container count will give you some wiggle room but you know deep inside that static thresholds are bad and lead to false alerts. If you have availability checks on 5 instances, will alerts go off when a few instances get killed intentionally? If this problem is not taken care of, you end up with the broken window problem. A few false alerts here and a few false alerts there eventually lead you to ignoring real alerts.
Unclear groupings when visualizing Service topology
A number of upcoming monitoring tools today create a visual topology of the services running inside Docker containers. With this, you’ll be able to see how containers are connected to one another. It is important to get clarity on how the topology itself is created and visualized. Most tools rely on the Docker image ID to group things together as a service.
Consider a Zookeeper cluster 1 being used with Kafka and another Zookeeper cluster 2 with Hbase. Visually, what you what to see in a topology map would be the below
Although there are 2 separate Zookeeper clusters, using image ID to group containers would just show this as one zookeeper cluster. The problem gets even more worse once you have different environments like staging, dev, prod. The way you want to visualize this topology would be separate for prod and staging as below:
However, using the Image IDs to group things together would lead to the below incorrect topology which would not map to your mental model of how things are actually setup and split.
- All containerized services that are running in the data center and an aggregated view of resource consumption.
- The containers that make up a service and the environments they are running in. That means that automated data collection setups won’t be able to simply depend on image IDs. Environments need to correctly be split up with tags
- From a service perspective, the hosts on which a service has it’s containers running.
- From a host perspective, the containers running on each individual host.
In the next post of this series, we will show how OpsClarity provides a monitoring for containerized infrastructure, that is completely automated and dynamic, one that does not rely on static configuration files. We will describe how OpsClarity provides logical visualizations and topologies that map to your mental model of how you have deployed your applications, leading to metric aggregations and analysis that is far more accurate.