How OpsClarity Leverages Data Science and AI to Improve Automation and Root-cause Analysis

  Data science, artificial intelligence (AI) and machine learning have transformed e-commerce, personalization, digital marketing, search engines etc. Large scale analysis of data has become a powerful tool for businesses to create competitive differentiation. Seemingly left out of this, however, is IT operations - the place where all this incredible computation for data science takes [...]

By | December 13th, 2016|Data Science, Machine Learning|0 Comments

Optimizing Alerts on Free space on disks using Machine Learning

The available space on the disk (diskfree) has a significant and often catastrophic impact on applications and services running on the system. For this reason, every DevOps engineer knows that it is crucial to carefully monitor disk usage in all critical systems, especially ones that tend to rapidly use up disk space, such as heavily [...]

By | November 10th, 2016|Analytics, Data Science, Machine Learning|0 Comments

Leveraging Anomaly Detection to Monitor for Errors

Metrics that represent errors pose a special problem. Error is often used in a generic sense that could imply something very serious where any non-zero values warrants investigation, e.g., 5xx errors, or, it could represent something that has an acceptable baseline value but where an unusual change could indicate serious problems, e.g., 4xx errors, page faults, [...]

By | October 23rd, 2016|Analytics, Data Science, Machine Learning|0 Comments

Queue Length in Messaging Systems – Kafka, SQS, etc.

The use of message queues/brokers is ubiquitous in any real-time application. Such intermediate modules (e.g. Apache Kafka, RabbitMQ, AWS SQS, etc.) improve system reliability by decoupling the producers from the consumers, thus freeing them from any synchronization requirements. A primary operational concern with such systems is whether the consumers are keeping up with the producers, [...]

By | October 21st, 2016|Analytics, Data Science, Machine Learning|0 Comments

Anomaly Detection and Machine Learning Applied to DevOps and Monitoring

In our previous blog post (Drowning in Alerts: Blame it on Statistical Models for Anomaly Detection), we talked about how standard anomaly detection constructs, such as SMA, exponential smoothing, etc., do not work well when it comes to Ops monitoring. At OpsClarity, we view monitoring as a multi-layered activity. For example, you may discover a [...]

By | October 21st, 2016|Data Science, Machine Learning|0 Comments

Drowning in Alerts: Blame it on Statistical Models for Anomaly Detection

With the recent spurt of web-scale applications, the problem of monitoring such systems has attracted a lot of attention. The scale and complexity of such systems, coupled with the non-trivial interactions between constituent components, makes it extremely challenging to ascertain the health and stability of the various services and components that make up such a [...]

By | April 25th, 2016|Analytics, Data Science|0 Comments

How to understand performance of data processing frameworks

Do you use data processing frameworks such as Apache Storm, Apache Spark, or Samza?  Do you know how to monitor them, as well as your applications running on top of them adequately?  In this post, I will describe the challenges of monitoring a complex data processing framework, using Storm as an example.   Production ready [...]

By | February 29th, 2016|Data Processing Frameworks, Data Science|0 Comments

Operational Knowledge Graph : The intelligence behind OpsClarity

The OpsClarity platform has several analytics constructs that are specifically designed to manage the hyper-scale, hyper-change microservices architecture of large, complex and distributed data intensive applications. The platform was built with the specific goal of significantly improving the troubleshooting workflow for these applications. It was designed from the ground up to handle the massive volume [...]

By | January 28th, 2016|Data Processing Frameworks, Data Science, Machine Learning|0 Comments

Why collecting more metrics and creating more graphs is a slippery slope

The rapid adoption of cloud-native and containerized microservices, distributed data processing frameworks, and continuous delivery practices has led to a massive increase in application complexity. This rapid change has very quickly led to obsolescence of current monitoring solutions. Several monitoring tools have taken the approach of efficiently collecting more metrics, while allowing the user to [...]

By | January 21st, 2016|Data Science, Web-scale Application Monitoring|0 Comments

Application monitoring in the age of Big Data

Operations (or Ops as it is colloquially known) has been necessary since the first shared systems came online in the 60s. While the job title didn't necessarily exist, the same questions needed answers then as they do now. - Is 'it' up and accessible? - Is 'it' responding in a reasonable timeframe? - Is 'it' [...]

By | January 19th, 2016|Data Science, Web-scale Application Monitoring|0 Comments