Monthly Archives: November 2013

Top 10 Big Data and Hadoop learning resources: My favorites!


How do we train ourselves on Big Data and Hadoop technologies without spending a lot of money?

Many of us are asking this very same question because learning how to program and develop for the Big Data platform can lead to very lucrative career opportunities!

Think about the huge skill gap in the market place for Big Data programmers and developers.

According to McKinsey & Company, the United States alone is likely to “face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions by 2018“.

There are some top quality trainings available in the market, led by qualified instructors.  However if you want to save some money, and prefer self learning, you still have some excellent options.

Many people even prefer self-directed training for Hadoop and Big Data as in that way there are opportunities to explore even more.

Okay, so where should we look?

I’ve listed my ten favorite resources below that you can use for your self-paced training.

The first five are the websites. Please spend some time on each of these and choose one that will work for you.

Then I’ve listed five books. The books are not free but still worth buying, at least one or two as they will be invaluable throughout your study.  If you want to appear in the Hadoop certification or actively work on projects, some of these books will definitely be handy.

Let’s take a look at the list now:

The websites

  1. Big Data University –  This is an online educational site which offers very good free online courses for beginners. These courses are self-paced and very well structured.  Many of them include hands-on exercises that you can do in the cloud or on your own PC.
  2. Cloudera – Follow Cloudera and you can teach yourself a lot on Big Data. Being one of the most popular and respected Big Data service and training providers, Cloudera offers excellent documentation with very good coverage on the Big Data technologies implemented in CDH (Cloudera data platform on Hadoop). In addition, Cloudera University offers free video training sessions on Apache Hadoop ecosystem and Big Data analytics. Use these free training videos to get started or as a refresher.
  3. Hortonworks – Why not try Hortonworks?  Go to the documentation page here. Also, I recommend you to watch the video tutorials from the Hortonworks University to understand the use cases. Note that Cloudera and Hortonworks both offer quick start virtual machines (VM) for Hadoop. Using the VMs, you can install and run a single node Hadoop cluster on your PC in no time!  For more details refer to my earlier post here.
  4. Yahoo Hadoop tutorial – Yahoo provides a very comprehensive free tutorial in a structured format.
  5. Apache Hadoop website – Finally don’t forget Apache Project website for Hadoop. Though not exactly a tutorial, you may need to come here several times while working on these technologies.

The Books:

  1. Hadoop the Definitive Guide – By Tom White (3rd Edition): This is may be the best Hadoop book available in the market now. If you want to appear for the Hadoop certification, this book is a must read.
  2. Hadoop in Action – By Chuck Lam: A very good book to start your journey on Hadoop and MapReduce programming. It’s easy to read with excellent good examples. However, the book is a bit dated as of now because it only covers Hadoop version 0.20.
  3. Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics – By Bill Franks – This is not a technical book. It is more intended for executive reading with a nice, broad, high level take on Big Data and related topics.  Excellent read for all of us.
  4. Big Data – Principles and best practices of scalable real time data systems – By Nathan Marz and James Warren:  This book is available in an early access edition for now.  It’s targeted more for the solution architects and application owners working to build a Big Data solution.
  5. Hadoop Operations – By Eric Summer: This is a very good book for day-to-day operational usage for both Hadoop developers and administrators.

Definitely there are lots of great videos in the internet covering almost all components of Big Data technologies. Just do a YouTube search and see! However, today I limit myself on the popular websites and books only.

One thing I’ve learned in my career, learning never stops! You can jump-start with any of these five website. For a deep dive, buy one or more books as needed. Happy learning!

I’ll be very happy to update the list with your suggestions and recommendations.


Try Big Data at home! HDP 2.0 is there in sandbox.

On October 29th Hortonworks has announced the availability of HDP 2.0 sandbox : a self-contained virtual machine with a single node Hadoop cluster preconfigured.

What is HDP sandbox?
It is a personal, portable Hadoop environment with the latest HDP (Hortonworks Data Platform) 2.0 distribution, packaged up in a virtual environment where you can run Hadoop and the related projects (Hive, Pig etc.) straight from your own machine!

You can explore all the enhancements of HDP 2.0 including YARN (MapReduce 2.0) and the latest versions of other community projects like HIVE, HBase, Pig etc.  See the Hortonworks announcement link here. This will also give you a set of training materials with step-by-step tutorials to get started on Hadoop and Big Data.(For a quick overview on how to get started on Big Data please refer to my earlier post here )

So what are the steps? These are:

  1. Download and install VMware player or VirtualBox suitable for your operating system, if you don’t have already).  Usually this is free for personal use. I’ve used VirtualBox.
  2. Download the suitable version of HDP 2.0 virtual appliance.
  3. Import the appliance and run Hadoop.

That’s it! I’ve used all the default configurations and it is working fine so far.

HDP 2.0 VM running using VirtualBox

HDP 2.0 VM running using VirtualBox

Why try Hadoop in the sandbox?  Several advantages.

  1. No need to install Hadoop on your own. Complexity of setting up a Hadoop cluster is high, and may be intimidating for the beginners.
  2. Typically these “Hadoop in-a-box” versions comes with excellent tutorials to get started.
  3. The latest versions of the other community projects in the ecosystem like Hive,Pig etc. comes along in the box – tightly integrated.
  4. No extra physical PC required. Use your favorite machine and operating system.
  5. Keeps your computer clean. Virtual box will not mess up your computer’s setting and file system.
  6. You don’t need internet connection. Play around with this anytime anywhere!

To me it is much easier to know the tool first by playing around and then try to learn the advanced skills like administration etc. Getting started with the virtual machines helps to do just that.

Also, you can quickly do a POC to evaluate your use case before trying a detailed and time-consuming implementation just for evaluation purposes.

However, Hortonworks is not the only option and you can also try Cloudera Hadoop sandbox. See the link to get Cloudera Quickstart VM here. Note that both Cloudera and Hortonworks contribute to and distribute 100 per cent open-source Hadoop platforms.

Any constraints on just using the sandbox ? Few came in my mind:

  1. Remember this is just the beginning. For serious developer / administrator, going to work in real-life 100+ nodes projects may need more hands-on than a single node sandbox.
  2. The preloaded tutorials are very good to get started but not detailed. You may enroll for the paid training courses if you are interested.
  3. If you set-up a Hadoop cluster from scratch, you will be rewarded with very good learning experience and much more control and flexibility overall.
Hadoop and Big Data tutorials in HDP 2.0 VM

Hadoop and Big Data tutorials in HDP 2.0 VM

Please share your experience after trying Big Data at home!