Introduction to Blockchain

The blockchain revolution is here..

And it disrupting everything.

Are we at the dawn of a new revolution? Is the blockchain revolution really as or more significant than the revolution brought about by the Internet?

Blockchain technology is being hailed as one of the most revolutionary technological advances of the day.

In a paper titled The Truth About Blockchain, Harvard Business Review hails this technology as foundational because of its potential to create a new basis for our economic and social systems. That paper predicts that the process of adoption will be gradual and steady, not sudden, as waves of technological and institutional change gain momentum.

PwC (PricewaterhouseCoopers) has named blockchain a tech breakthrough megatrend for CIOs, and Gartner has named it to its list of the top 10 strategic technologies for 2017. After hitting the peak of Gartner hype cycle in 2017, blockchain technology is predicted by Gartner to advance most quickly in the manufacturing, government, healthcare, and education sectors in coming days.

Source: Gartner

In this series of blog posts, we will deal with the core concepts related to blockchain technologies: patterns, cryptocurrencies, and the underlying technologies and architecture.

In the current blog post we start a general introduction of blockchain and address some basic questions.

Lets get started..

What is blockchain?

Blockchain is a seamless, secure, decentralized digital ledger that records all transactions, settlements, and ledger updates for multiple parties.

It maintains this continuously growing list of all transactions in records called blocks. Blockchain creates a secure way to share these pieces of information and perform transactions in a decentralized manner without approval from a single central party like a bank or credit card issuer. Only authorized network members can see details of their own transactions. All updates in the shared ledger system are validated and updated in all participants shared ledger for transparency, security, and accuracy. Because of all these updates, blockchain is immutable, auditable, and can easily trace its past activities.

How this can change the world?

Blockchain technology creates smart contracts that are embedded in digital code and securely stored in transparent, shared databases, where they are protected from deletion, manipulation, and revision. As a result, every agreement, every process, and every payment can have a digital record and signature that can be accurately identified, validated, stored, and shared.

Third-party intermediaries like lawyers, brokers, and bankers may no longer be needed to authorize transactions. All participating players in any transaction, from any part of the world, can freely and seamlessly transact and interact with one another in an open and secure way. These features drastically cut transaction cost and time; this is the immense potential of blockchain.

What are the main obstacles?

Despite blockchains potential impact on almost every business, there are several concerns slowing its widespread adoption.

Concerns about privacy and security are one of the biggest obstacles to blockchain adoption. Blockchain data are publicly visible by design, which creates regulatory concerns for governments and corporations. Governments and corporations have specific regulatory requirements to protect access to their data. Currently, there is no laws and regulation to address the new use cases with blockchain adoption. So new regulations need to be created and current laws must be altered. All these are time-consuming processes.

Like all new and nascent technologies blockchain implementations have vulnerabilities. Cyber criminals have been able to hack and steal from multiple cryptocurrency exchanges. Governments of several countries and corporations have expressed concerns over opportunities for malicious behavior and potential gaps in security of using and transacting cryptocurrencies. Vendors are working on building strong encryption and security mechanisms, and many blockchain entities and consortiums are evaluating options to ensure the technology is trustworthy for protecting private information.

There are other important obstacles to blockchain adoption including its sheer complexity, high initial costs, the difficulty of integration with legacy systems, and a talent gap.

When it was introduced?

Blockchain was first described in 2008, when a revolutionary paper on the topic of peer-to-peer electronic cash, Bitcoin: A Peer-to-Peer Electronic Cash System, was published under the pseudonym Satoshi Nakamoto. That paper introduced the term chain of blocks, which has now evolved into the word blockchain. To this day, no-one knows who Satoshi Nakamoto really is.

What is bitcoin?

The first major blockchain technology innovation was bitcoin, a digital currency. Bitcoin is digital cash that is exchanged through the Internet, peer-to-peer via a trustless system.

Every time someone buys bitcoin (or any digital coin) on a decentralized exchange, sells coins, transfers coins, or buys a good or service with virtual coins, a record of that transaction is added to the blockchain via blocks. People compete to mine bitcoins using computers to solve complex math puzzles.

(In the upcoming blogposts we discuss bitcoin mining and how transactions work in more details)

Bitcoin has following major characteristics.

  • Decentralized No single authority controls bitcoin. Bitcoin operates in a trustless system. See the section below named What is trustless system?
  • Controlled supply Total number of bitcoins that can be generated (mined) is fixed. Unlike traditional currencies like dollar, euro etc. bitcoin cannot be issued in an unlimited manner.
  • Immutable Once complete, Bitcoin transactions cannot be reversed or tampered with.
  • Divisible – The smallest unit of a bitcoin is called a satoshi. It is one hundred millionth of a bitcoin (0.00000001). This enables micro transactions that traditional currencies cannot.
  • Anonymous – Bitcoin can be used to do transaction anonymously. Participating parties don’t need to identify themselves while doing a transaction.

What is trustless system?

Traditionally, transactions in conventional currencies require a trusted third-party, a central authority, to keep a ledger of who owns how much. Examples of trusted third parties include banks, credit card companies, brokers, etc. After adopting this technology, no central authority or third party is needed; each part of the ecosystem can validate what the other parts are telling it without needing to trust anybody. For example, if we create a bitcoin transaction, all nodes in the distributed system that receive it verify that the signatures are valid and discard the transaction if they are not. The bitcoin blockchain is shared among all of the participants, all of whom can independently verify its validity.

Bitcoin refers to many things!

Blockchain technologies can be confusing, as the term bitcoin refers to three different things.

  • First, bitcoin refers to the underlying blockchain technology platform.
  • Second, bitcoin refers to the protocol that dictates how assets can be transferred.
  • Third, bitcoin refers to the largest cryptocurrency (so far) in the market.

Following diagram will show the three entities and how they are interconnected.

We will deep dive into underlying technologies and various implementations in the upcoming posts..



  • Mastering Bitcoin: Programming the Open Blockchain By Andreas M. Antonopoulos
  • Blockchain: Blueprint for a New Economy – Melanie Swan
  • Mastering Blockchain – Imran Bashir

White Papers

  • Bitcoin: A Peer-to-Peer Electronic Cash System – Satoshi Nakamoto
  • Blockchain: The Fifth Disruptive Computing Paradigm – Keegan F Denery

Internet Sources


Data ingest options in Hadoop

What are the ways to ingest data into Hadoop or other Big Data sources? In this post I’ll try to consolidate.

There are several tools and methods available to get data into the Hadoop system. These ranges from simple file copy to more sophisticated tools such as Flume and Storm. Options can be overwhelming and you need the right tool for your job.  I’ll attempt not to provide details to these tools, but rather bring them in the table and provide the key highlights so that we know the options and deep dive in any of them as we choose. The suggested references will provide comprehensive and in-depth overviews for these tools.

  • File Transfer
    • What is it?

Simplest “all-or-nothing” approach of getting data into Hadoop. This can be further classified                 into two categories:

  • Using HDFS file commands: For example, use Hadoop fs –put command for byte by byte copy from traditional file system to HDFS
    • Pros:
      • Simple and easy to use.
      • Any types of file can be transferred—text, binary, images, etc.
    • Cons:
      • All or nothing batch processing approach.
      • No transformation supported.
      • Always single-threaded.
  • Using mountable HDFS: Allow HDFS to be mounted as standard file system. Example Fuse-DFS, NFSv3 interface. These distributed file system protocols allow access to files on a remote computer in a way similar to how local file system is accessed.  For example, with a NFS gateway for Hadoop, files can be browsed, downloaded and written to and from HDFS as if it is in local file system.
  • Pros:
    • Provides same benefits as in HDFS file commands.
    • Facilitates direct HDFS access to the user. Further simplifies data management.
    • Allows collaborative use of other tools that are not Hadoop-aware.
  • Cons:
    • Have same limitations as in the HDFS file commands. Random writes are still not supported; the file system is “write once, read many”.
    • Costlier than HDFS file command options. High TCO for multiple servers.
  • Sqoop
    • What is it?

Used to import bulk data from Relational Database Management System (RDBMS) into Hadoop.              Internally Sqoop generates map-only MapReduce jobs, connects the source database using                    JDBC driver, selects portion of data as specified and writes data into HDFS.

  • Pros:
    • Standard approach to ingest data from RDBMS tables.
    • Relatively easy to use.
    • Use multiple mappers to parallelize data ingest (default 4 mappers)
    • Flexible and extensible.
    • Can handle incremental data ingest.
  • Cons:
    • Usage limited to RDBMS data ingestion only.
    • It is generic solution. Many vendors have specialized product that may give better performance.
    • Security concerns with openly shared credentials.
    • Tight coupling between data transfer and the serialization format causes some connectors may not support a certain data format.

                Note that Sqoop is getting enriched to Sqoop2 with significant improvement of various features.

  • Flume
    • What is it?

Distributed, reliable, and available system for the efficient collection, aggregation, and movement of        streaming data. Primarily used to move log data, or any massive quantity of event data from social          media, network traffic or message queue events.

  • Pros:
    • Easy and reliable way to near real-time loading of streaming data in HDFS
    • Recoverable
    • Declarative – no coding skill required
    • Highly customizable.
    • Supported by a number of Hadoop distribution providers
  • Cons:
    • Relatively complex to configuration and deploy.
    • Not for real-time heavy lifting. Latency can be anywhere from 10-15 minutes to half a day for a typical production system.
    • Not ideal for in flight processing of streaming data.
  • Storm
    • What is it?

Storm is an open source, language independent, distributed, real time computation framework. It            is an example of Complex Event Processing (CEP) system. A CEP system tracks and analyzes                  stream of information and derives conclusion from that in the real time.

  • Pros:
    • Continuous real-time query with low latency
    • Distributed high volume data processing / back end processing for streaming data
    • Reliable and scalable
  • Cons:
    • Strom transforms a stream of messages into new streams. So you may need other tools to store and query the output of Storm.
    • Not for static data processing.
  • Vendor provided connectors / integration packages
    • What is it?

This is category contains the commercial vendor supplied data integration tools (sometimes                    coupled with their technologies). Example: Oracle Big Data Connectors, Informatica Power Center,          XPlenty’s ETL in Cloud etc.

  • Pros:
    • Comes as packaged solution / suite – yields broader data management solutions.
    • Preferred for existing partnership, and already established trusts with vendors
    • Better support (sometimes)
  • Cons:
    • Costlier options.
    • Sometimes offer generic solution – still needs custom coding for specific use cases.
    • Sometimes tightly coupled to vendor’s technology / database for optimal performance.

Resources / References


How to Prepare for the Cloudera’s Hadoop Developer Exam (CCD-410)

Cloudera’s Hadoop developer certification is the most popular certification in the Big Data and Hadoop community.  As I’ve recently cleared the CCD-410 exam, I want to take the opportunity to provide few points that helped me with preparing for the exam and more importantly learning Hadoop in a practical way.

Here are these:

  1. Tom White’s Hadoop: The Definitive Guide book is an invaluable companion for you to clear the exam. This may be the only book you need as this will help you to address almost all conceptual questions in the exam. Be sure to grab the 3rd edition (latest till date) that covers YARN.
  2. Don’t overlook the other Apache projects in Hadoop’s ecosystem like Hive, Pig, Oozie, Flume, and HBase. There will be questions testing your basic understanding of those topics.  Refer to the related chapters in the Tom White’s book.  Also, there are always very good YouTube videos and tutorials available on the web.
  3. Understand how to use Sqoop. The best way to start may be to create a simple table in MySQL (or any database you choose) and import the data into HDFS as well as in Hive. Understand the different features of the Sqoop tool. Again, Tom White’s book can be used as well as the Apache Sqoop user guide.
  4. Understand Hadoop fs shell commands to manipulate the files in HDFS.
  5. To clear the exam you need to be hands-on in the basics of MapReduce programming, period. You will find a lot of questions in the CCD-410 exam asking about the outcome/possible result set based on a given MapReduce code snippet. You need to know and practice is how to convert the common SQL data access patterns into MapReduce paradigm. Also, there will be questions to test your familiarity on key classes used in the driver class and the methods used (for example: Job class and how it is used to submit a Hadoop job)

Tip: Create two simple text files with few records similar to standard emp and dept tables. Load the files into HDFS. Then develop and test your MapReduce programs to produce outputs similar to the following queries:

  • Select empl_no, empl_name from  emp;
  • Select distinct dept_name from dept;
  • Select empl_no, empl_name from emp where salary > 75000;
  • Select empl_name, dept_no, salary from emp order by dept asc, salary desc
  • Select dept_no,count(*) from emp group by dept_name having count(*) > 1 order by dept_name desc;
  • Select  e.empl_name, d.dept_name from emp e join dept d on e.dept_no = d.dept_no;

6. You are expected to understand basic Java programming concepts. This is no sweat for the persons regularly working in the Java environment, but for the rest of us a basic Java refresher course will be very handy. Pay particular attention to the following topics that will be very helpful in writing and understanding MapReduce codes.

  • Regular Expression
  • String handling in Java
  • Arrays processing
  • Collection Framework

7.  Finally, don’t forget to refer to the Cloudera website for the latest updates, study guides and sample questions for the specific certification you are targeting.

Note that you can optionally buy a practice test from Cloudera website. If you have a good preparation and want to self-check your exam readiness you may try this out (Disclaimer: I did it).

I also recommend that you to go through the following article from Mark Gschwind’s BI Blog. The article gives you a solid direction to jumpstart your preparation as well as learning Hadoop.

All the best in your journey to learn Hadoop and get certified! Please share your experience and comments.


What’s new in MongoDB – 2.6

MongoDB recently released version 2.6. This is a major release with significant improvements in performance, manageability, security and availability areas. As per the vendor, this release also builds the foundation of the much-talked MongoDB concurrency to be available in the next 2.8 release.

Here I’ve mentioned some areas that interested me. Please refer the complete release note from vendor’s website for more details.

  • Aggregation related enhancements – Now there is no restriction on the size of result set returned by the aggregation pipeline. Aggregation now can return cursor or write output to a collection. Aggregation framework now also started supporting variables.
  • Query optimizer improvements – This may be the most noticeable improvement in the 2.6 release. Query planner engine is completely rewritten to improve performance and scalability. New features like index intersection are introduced that now can use more than one index to determine the best query execution plan. Index filters and query plan cache methods add more sophistication to the query plan engine.
  • Index related enhancements – Now, secondaries can build indexes in the background. With this enhancement, now you can build indexes (in the background) in the secondaries – letting the indexing operation available to be used for other operations. Previously, this was possible only for primaries. Another improvement is the ability to resume interrupted index builds.
  • Storage related – This version uses PowerOf2Sizes as default allocation strategy for all new collections. Though this strategy will take more storage space, this will result in lower levels of storage fragmentation and more predictable storage capacity planning.
  • Security related – MongoDB 2.6 claims several security related enhancements including improved SSL support, improved authorization system with more granular controls, centralized credential storage and improved user management tool.
  • Search related – Text search is enabled by default in 2.6. The new operator $text in the aggregation framework will search the text data on the content of fields indexed on a text index.


Additional Read / Reference

MongoDB Blog








How to start Splunking – Part 2

In the previous post we have talked on Splunk – the powerful tool for searching and exploring the unstructured data. We have also discussed how we can get Splunk up and running in our PC in no time. Today, let’s explore Splunk’s powerful search capability with few examples.

The first step is to add some data. Loading files into Splunk is easy enough:

  • Click Add Data from the Welcome screen. (See the Add Data panel in the right hand side in the following diagram)
  • To add a file, click From files and directories on the lower half of the screen that will pop up.
  • Click the radio button next to Upload and Index a file.
  • Then click Save.


That’s it. You have added your file into Splunk. For more details and to get the sample files go here.

Note that I’ve used the sample files available in the Splunk website to show the examples in this blog – you can use the same to explore the features or get your own.

What Splunk does when you add data?

 In Splunk’s terminology, the data we add is called raw data. After getting the raw data, Splunk first indexes them. The indexing is done on at least four default fields – source, source type (what kind of data), host and timestamp. After indexing, Splunk divides the data into individual events and orders it based on timestamp. The events are arranged, searched and returned as result set to the users.

Before we submit our first command there are two important panels to look at:

  • The search bar at the top. We will put our search commands here.
  • The time range picker to the right of the search bar. This permits us to adjust the time range. You can see events from all time, last 15 minutes, or last 1 day and so on. For real-time streaming data, you can select an interval to view.

Okay, now let’s enter a search text in the search bar and then click the search option. I’ve used a simple search criteria where I like see all files with the text : “error”.

Note that the page switches to the search dashboard showing a screen with results and many useful details as shown below.


Now, let’s make the search little more restricted as I want to see all “errors” only for the host “ww1”. We can use the following command that adds a AND condition.

 error AND host=www1


 So far so good. Now let’s make it more interesting. Try this:

error AND host=”www1″ | top uri

Here the command summarizes the events by returning the most frequently occurring URIs from the host “ww1” having the text “error” in it. See the summarized result showing the count and the percentage as well for the top URIs.


Now if you click on the Visualization tab under the search bar, you will see the graphical representation of the summarized result.


Few important things to note:

  1. Spunk commands are not case-sensitive but field names are.
  2. We need quotation mark around phrases or field values containing white spaces, commas, pipes and any other breaking characters.
  3. Use pipe to pass results of one operation as an input to the next one.

That’s it for today. Explore the powerful search commands and visualize your data. Happy Splunking!

References / Further Read

How to Start Splunking

As you probably know, Splunk provides realtime operational intelligence on machine data. It takes any unstructured data coming from various sources like websites, servers, networks, sensors and mobile devices. Splunk adds them all into searchable, intelligent indexes that enables to search, analyze, and visualize your data.

A common use case of using Splunk is to monitor all your devices to track down any issue or outage. Splunk will continuosly search log files from all your devices – servers, firewalls, applications, databases, routers, and load balancers etc. Then it gathers all relevant pieces of information into a central index that you can rapidly search to hunt down any issue.

Splunk has lots of other interesting use cases. The company is focusing on customer enablement by buidling special use cases adressing their customer’s demand. As per Splunk website, “more than 7,000 enterprises, universities, government agencies and service providers in more than 90 countries use Splunk Enterprise software”.

Recent announcement of Spunk’s strategic alliance with Tableau will help customers combine the power of Tableau’s structured data visualization with Spunk’s ability to unlock the unstructured data. This is a very important alliance in advanced BI and data visualization market.

Rather that reading further deep, I’ve decided to play around with Splunk’s powerful search capabilities.

Here is what I didScreen Shot 2014-03-16 at 8.33.41 AM

You can download and install Splunk for free for personal or small-scale use. Just sign-up for a free Splunk account and pick the version appropriate for your operating system.

Here is the instruction to download and install Splunk. I’ve installed Splunk in my Mac (OS X version 10.9.1) using the DMG installer.

After the installation, you will get the message as shown in the right hand side. Follow the instruction in the screen to start Splunk.

Quick Note on OS X Installation: While installing Splunk in OS X, a helper application is loaded first that displays a dialog offering several choices on what you want to do after the install. Choose “Start and Show Splunk” option. You can run the helper application again to either show Splunk Web or stop Splunk.

The Splunk web interface is at http://your-host-name:8000.

Screen Shot 2014-03-16 at 8.33.08 AMAfter the install, follow the link to open the browser window. For the first time, it will ask to log-in using username as admin and password as changeme that you will be prompted to change immediately.

After changing the password, the Splunk web browser will open up and you are ready to play.

So here you are! Splunk is now running in your computer to consume and unlock the power your data.

I plan to load some sample log files to explore the search capabilities in my coming posts.

Splunk Home Page_1

Are you familiar with R?

‘R’ is free open source statistical analysis software. It is a full-fledged programming language intended to process large and complex datasets. R compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Though mostly used in research, academic, and development environments, R is gradually growing in usability in commercial environments with the increasingly focus on data driven businesses and data analytics.

As R is heading into mainstream, the skill is getting very popular among data scientists, statisticians and analytics professionals. As per the Dice Tech salary survey, R is one of the top paid IT skills in 2014.

If this sounds interesting, you may want to invest time to play around with it. Here are the five web resources that you can use.

  1. – Try codeschool to jumpstart R without the hassle of any installation or setup.
  2. –  very good introductory tutorial from John Cook.
  3. –  R video tutorial from Coursera
  4. – Official R-Project website for latest updates and references
  5. Getting Started with the R Data Analysis Package By Norm Matloff

One of the main reasons of growing popularity of R is it offers benefits to companies trying to reduce their software expenditures with enterprise software vendors such as SAS and SPSS. With the increasing demand, analytic professionals, data miners and aspiring data scientists can make a very good move by adding the important R-skillset into their arsenal.

References / Further read