(This is a guest post from Jayant Shekhar, CEO of Sparkflows.io, a company focussed at building and deploying machine learning & large scale applications.)

Leveraging Kafka and Spark Streaming combined with a NoSQL database is a common pattern which is getting increasingly popular for Near Real-Time (NRT) analytics. NRT is empowering enterprises across various verticals such as e-commerce, financial services, advertising technology, retail, IoT, public sector and more to interact with high-velocity streams of data as soon as data is created or ingested.

Apache Kafka provides a large scale distributed system for streaming events. Apache Spark provides programmers with an API centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk.

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.

Spark Streaming provides what is now being commonly called mini-batch processing. Many of the NRT use cases fall under the category of mini-batch processing. With Spark Streaming, the batch interval can be set for as low as 1 sec. Spark Streaming also allows window operations which are critical for many use cases. Window operations allow us to apply transformation over a sliding window of data. For example, a use case in which we need to do analysis of logs coming from web servers. We would be interested in processing the events in the last 30 minutes, but want to refresh the results every 30 seconds.

Spark Streaming allows reuse of all the constructs that are used when it is run in batch mode. This means the same code can be easily re-used across batch and streaming use-cases. Spark’s primary advantage over other frameworks is that it provides a several high level constructs like group by, joins, distinct, union etc., allowing for quick and easy development of large scale systems.

Once the data is processed with Spark Streaming, it is common to store the results into a NoSQL database like HBase. HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

The combination of the above mentioned technologies, Kafka, Spark Streaming and HBase, allows businesses to build near real-time applications that serve various use-cases such as recommendation/personalization, churn prediction, fraud detection and operations analytics. However, developing and monitoring these applications is still relatively complex. The technologies themselves are complex, with few individuals having experience in these nascent data frameworks. Moreover, monitoring these distributable frameworks is not easy and often requires time consuming configuration and still does not provide end-to-end visibility into common concerns that apply to the data pipeline.

It is hard to operationalize them because many components and services are involved in building data pipelines. Schema management and data type management take a lot of time and effort. The number of configurations and tuning parameters that go into them are very high.

Sparkflows eases and accelerates the development of data pipelines. It is a developer friendly solution that provides an intuitive, intelligent user interface for defining datasets, workflows, schema propagation and dashboards. Sparkflows provides an exhaustive library of of machine learning algorithms running on top of Spark ML and also integrates with systems like Kafka, HBase, Solr etc. It provides an ever increasing number of building blocks which are key to building data pipelines including OCR, NLP, reading unstructured data etc. Sparkflows also provides a number of templates in forms of workflows for recommendation systems, churn prediction etc. that can be easily applied for rapid development of these applications.

Sparkflows also makes it easy to operationalize the data pipelines. Workflows are represented with json files containing all the details of the nodes and edges of the workflow. The jar files contain all the logic for the actual execution of the workflow. Thus the workflow can be easily submitted on the cluster with spark-submit or scheduled with any scheduling tool that is being used.


Use-case: Streaming Taxi Data


In the rest of this post, I will describe how to design and implement a streaming taxi-data system with Sparkflows and leverage OpsClarity to monitor this data pipeline at runtime. The incoming data from taxis includes taxi id, timestamp and the geo-coordinates. The streaming application loads this data into HBase for low latency queries on individual taxis.  It is also required  to calculate the total distance traveled by a taxi in the last x minutes/hour. Moreover, rules are applied to the incoming streaming data that generate alerts where limits are breached.


Design and Implementation with Sparkflows


The data is streamed through Kafka from where Spark Streaming consumes it, processes it, applies the rules and loads the results into HBase.

This  would normally be hand-coded, making it time consuming to implement, manage, extend and deploy. Also it can take several arduous days and weeks to setup and configure monitoring.

With Sparkflows we build the workflow using prebuilt nodes in the workflow editor. We are also able to test it out interactively within the web browser. Once ready, we take the json representation of the workflow and deploy it on the spark cluster. The whole process for this use case takes less than 30 minutes to build and deploy end to end. The execution time at each step is also available allowing us to optimize the right areas for performance.

Below is the workflow we built with Sparkflows. Schema propagation happens seamlessly through the workflow making it easy to configure each of the nodes.



kafka-spark-workflowThe input data consists of text lines in the following format:

taxi id, date time, longitude, latitude


The Kafka node pulls in the data from Kafka, FieldSplitter splits the line column in the incoming records into taxi_id, date_time, latitude and longitude columns. The ConcatColumns node concatenates the taxi_id and date_time columns to generate the row key for HBase. HBaseLoad node maps and loads the records into HBase




Normally it is a lengthy process to load data into HBase. The visual mapping with Sparkflows takes the complexity away.

In each micro-batch interval we can receive multiple events from a taxi. For the use case, the total distance travelled by any taxi in the interval is calculated and loaded into another table into HBase. This means that the front end querying HBase does not have to deal with too many computations and overall the system is scalable. For this purpose, we write a processing node called NodeDistanceComputer and plug it into the system. It is not included in the workflow and would be there in the next post. We, thus see that Sparkflows is developer friendly in terms of adding functionality and also having it plugged into the User Interface.

Rules engine over streaming data is also upcoming feature in Sparkflows. It would read in the rules and apply them over the stream or multiple streams of data. It would thus enable complex event processing.


Monitoring data pipelines with OpsClarity


We monitor our data pipelines with Opsclarity. Setting up OpsClarity was remarkably easy. All we had to do was install OpsClarity agents on each of the machines. Everything else was automated. OpsClarity automatically discovers all the services running on the hosts, creates logical services clusters, configures metric collection, and sets up anomaly detection for high quality alerting and notification. This process, which typically can take weeks, was done automatically by OpsClarity in a matter of a couple of hours once we installed the agents. In addition, OpsClarity provided a logical topology of our entire data pipeline.




OpsClarity allowed us to get end to end visibility in the pipeline. We were able get visibility into common concerns that typically affect NRT data pipelines, such as through-put, back-pressure, latency etc. all in a user friendly and extremely intuitive topology that is overlaid with the health of each logical service. Because it not only collects raw metrics, but also captures an enormous amount of meta-data associated with each metric and service, it is able to apply very specific anomaly detection and alerting rules, that allow for early detection and notification of issues that affect the data pipeline.