Data ingest options in Hadoop

What are the ways to ingest data into Hadoop or other Big Data sources? In this post I’ll try to consolidate.

There are several tools and methods available to get data into the Hadoop system. These ranges from simple file copy to more sophisticated tools such as Flume and Storm. Options can be overwhelming and you need the right tool for your job.  I’ll attempt not to provide details to these tools, but rather bring them in the table and provide the key highlights so that we know the options and deep dive in any of them as we choose. The suggested references will provide comprehensive and in-depth overviews for these tools.

  • File Transfer
    • What is it?

Simplest “all-or-nothing” approach of getting data into Hadoop. This can be further classified                 into two categories:

  • Using HDFS file commands: For example, use Hadoop fs –put command for byte by byte copy from traditional file system to HDFS
    • Pros:
      • Simple and easy to use.
      • Any types of file can be transferred—text, binary, images, etc.
    • Cons:
      • All or nothing batch processing approach.
      • No transformation supported.
      • Always single-threaded.
  • Using mountable HDFS: Allow HDFS to be mounted as standard file system. Example Fuse-DFS, NFSv3 interface. These distributed file system protocols allow access to files on a remote computer in a way similar to how local file system is accessed.  For example, with a NFS gateway for Hadoop, files can be browsed, downloaded and written to and from HDFS as if it is in local file system.
  • Pros:
    • Provides same benefits as in HDFS file commands.
    • Facilitates direct HDFS access to the user. Further simplifies data management.
    • Allows collaborative use of other tools that are not Hadoop-aware.
  • Cons:
    • Have same limitations as in the HDFS file commands. Random writes are still not supported; the file system is “write once, read many”.
    • Costlier than HDFS file command options. High TCO for multiple servers.
  • Sqoop
    • What is it?

Used to import bulk data from Relational Database Management System (RDBMS) into Hadoop.              Internally Sqoop generates map-only MapReduce jobs, connects the source database using                    JDBC driver, selects portion of data as specified and writes data into HDFS.

  • Pros:
    • Standard approach to ingest data from RDBMS tables.
    • Relatively easy to use.
    • Use multiple mappers to parallelize data ingest (default 4 mappers)
    • Flexible and extensible.
    • Can handle incremental data ingest.
  • Cons:
    • Usage limited to RDBMS data ingestion only.
    • It is generic solution. Many vendors have specialized product that may give better performance.
    • Security concerns with openly shared credentials.
    • Tight coupling between data transfer and the serialization format causes some connectors may not support a certain data format.

                Note that Sqoop is getting enriched to Sqoop2 with significant improvement of various features.

  • Flume
    • What is it?

Distributed, reliable, and available system for the efficient collection, aggregation, and movement of        streaming data. Primarily used to move log data, or any massive quantity of event data from social          media, network traffic or message queue events.

  • Pros:
    • Easy and reliable way to near real-time loading of streaming data in HDFS
    • Recoverable
    • Declarative – no coding skill required
    • Highly customizable.
    • Supported by a number of Hadoop distribution providers
  • Cons:
    • Relatively complex to configuration and deploy.
    • Not for real-time heavy lifting. Latency can be anywhere from 10-15 minutes to half a day for a typical production system.
    • Not ideal for in flight processing of streaming data.
  • Storm
    • What is it?

Storm is an open source, language independent, distributed, real time computation framework. It            is an example of Complex Event Processing (CEP) system. A CEP system tracks and analyzes                  stream of information and derives conclusion from that in the real time.

  • Pros:
    • Continuous real-time query with low latency
    • Distributed high volume data processing / back end processing for streaming data
    • Reliable and scalable
  • Cons:
    • Strom transforms a stream of messages into new streams. So you may need other tools to store and query the output of Storm.
    • Not for static data processing.
  • Vendor provided connectors / integration packages
    • What is it?

This is category contains the commercial vendor supplied data integration tools (sometimes                    coupled with their technologies). Example: Oracle Big Data Connectors, Informatica Power Center,          XPlenty’s ETL in Cloud etc.

  • Pros:
    • Comes as packaged solution / suite – yields broader data management solutions.
    • Preferred for existing partnership, and already established trusts with vendors
    • Better support (sometimes)
  • Cons:
    • Costlier options.
    • Sometimes offer generic solution – still needs custom coding for specific use cases.
    • Sometimes tightly coupled to vendor’s technology / database for optimal performance.

Resources / References



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: