Hortonworks has announced its plan to include a fully certified version of Storm in their Hadoop data platform in first half of 2014.
What is Storm and what is its role in the Big Data architecture?
Storm is an open source, language independent, distributed, real time computation framework.
It is an example of Complex Event Processing (CEP) system. A CEP system tracks and analyzes stream of information and derives conclusion from that in the real time.
Storm vs. Hadoop
So does Storm evolve as a competitor of Hadoop in the Big Data computation space?
No! They have two different problems to solve and actually complement each other. Let’s see the key differences.
- Hadoop job runs on the static data but Storm runs on the streaming data.
- Hadoop processes data in batch with a latency. Storm processes data in real time. Because of this, sometimes Storm is touted as “real time Hadoop”.
- Hadoop job stops when the data processing is complete. However, Storm process never stops and continues to process as data stream keep coming to the system.
- Hadoop offers storage but Storm does not.
So, Storm is more like pipeline of processing data that connects various computing nodes and deliver outputs. Because of this characteristics, Storm is used in situations where data arrives dynamically from any sources (structured or unstructured) or consumed from real-time manipulation of some devices like sensors, machines or a stock trading system.
What is common between Storm and Hadoop?
Like Hadoop, Storm is a distributed data processing system. Both Hadoop and Storm can process huge amount of data in a scalable, fault-tolerant and distributed computing environment.
Both Hadoop and Storm are open source projects, offer simplified programming model abstracting distributed computational complexities, and run in several programming languages.
The important features of Storm are:
- Free and open source – licensed under Eclipse Public License.
- Simple programming model – Storm abstracts all the complexities of doing real time processing from the developer.
- Runs on any programming language – Clojure, Java, Ruby, Python are supported by default. By implementing a simple Storm communication protocol, Storm can be used to support other programming languages as well.
- Horizontally scalable – runs across cluster of machines in parallel.
- Guaranteed message processing – Storm guarantees that each message will be fully processed.
- Fast – processes streaming data in real time.
- Runs in two modes: local or distributed.
Key components of the Storm architecture are:
- Tuple: A tuple is a basic unit of the Storm data model. It consists of a list of fields. A field in a tuple can be a primitive type or any object type. A serializer needs to be implemented to use an user defined type as a field in a tuple.
- Stream: Stream is an unbounded sequence of tuples. It abstracts the data flow in Storm.
- Spout: A spout is a source of a stream. Spout can generate its own stream or can read from a queueing broker or any other external source (example: Twitter API).
- Bolt: Bolt is the entity that transforms the stream. So bolt can implement actions like filtering, formatting, aggregation etc. Bolt can do map and reduce tasks or communicate with external systems like databases.
- Topology: Topology is the network of bolts and spouts. It is basically a graph of stream transformation where each node is a spout or a bolt. Edges in the graph determine which bolts are subscribed to which stream. When a spout or bolt emits a tuple to a stream, it sends that tuple to every bolt subscribed to that stream. After implementation, a topology runs indefinitely.
A Storm cluster consists of a master and several worker nodes, with coordination done by the Zookeeper.
- Nimbus: Nimbus is the daemon run by the master node. Nimbus distributes the code across the Storm cluster, assigns tasks and monitors failure. In that way, it is similar to the Jobtracker in the Hadoop.
- Supervisor: Supervisor is the daemon run by the worker node. Supervisor listens to the work assigned to it and then runs worker processes to get the job done. So it is similar to the Tasktracker in Hadoop.
- Zookeeper: Zookeeper is responsible for maintaining the coordination service between Supervisor and Nimbus.
Resources & References
- You can start from the Storm project website.
- See documentation and tutorials in the Storm wiki.
- If you feel confident, go ahead and build a cluster following the github documentation. I’ve seen a good post on setting up a multinode Storm cluster here.
- Visit Hortonworks to understand their offering and to find set up information.
Storm addresses the problem of real time processing of Big Data. Now you can solve all your Big Data problems by bringing batch-processing Hadoop and real-time processing Storm into your ecosystem. Many companies have successfully adopted Storm as a real time data integration system to address their Big Data challenges.
The new YARN based Hadoop architecture truly facilitates the marriage between Storm and Hadoop, by enabling to run Storm in the same Hadoop platform and utilizing the same computational and storage resources in the Hadoop cluster. It is not surprising to see the commercial vendors like Hortonworks are trying to deeply integrate Storm in their offering of open source Hadoop platform.
- Storm Technical Preview Available Now! (hortonworks.com)
- Enterprise Hadoop Market in 2013: Reflections and Directions (hortonworks.com)
Tagged: apache storm, ApacheHadoop, ApacheStorm, big data, Big data computation, Complex Event Processing, data architecture, HortonWork, MapReduce, Open source, real time computing, storm, Storm project, stream computing