Introduction to Hadoop

Hadoop was created by the Apache foundation as an open-source software framework capable of processing large amounts of heterogeneous data-sets in a distributed fashion (via MapReduce) across clusters of commodity hardware on a storage framework (HDFS).  Hadoop uses a simplified programming model.  The result is Hadoop provides a reliable shared storage and analysis system.

MapReduce is a software framework that allows developers to write programs that perform complex computations on massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.  MapReduce libraries have been written in many programming languages (usually Java), with different levels of optimization.  It works by breaking down a large complex computation into multiple tasks and assigning those tasks to individual worker/slave nodes and taking care of coordination and consolidation of the results.  A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.  When data is pushed to HDFS, it will automatically split into multiple blocks (128MB by default) and stores/replicates the data across various datanodes, ensuring high availability and fault tolerance.  See How Does HDFS (Hadoop Distributed File System) Work? and Data Blocks in the Hadoop Distributed File System (HDFS).

NameNode holds the information about all the other nodes in the Hadoop cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of  the Hadoop cluster.  Each DataNode is responsible for holding the data.  JobTracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.  TaskTracker is responsible for running the task/computation assigned to it.

Untitled picture

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.  A slave or worker node acts as both a DataNode and TaskTracker.


There are many other tools that work with Hadoop:

Hive is part of the Hadoop ecosystem and provides an sql-like interface to Hadoop.  Hive uses MapReduce code to extract data from the Hadoop cluster.  Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files.  Hive supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop.  Hive appeals to data analysts familiar with SQL.

Pig enables you to write programs using a procedural language called Pig Latin that are compiled to MapReduce programs on the cluster.  It also provides fluent controls to manage data flow.  Pig is more of a scripting language while Hive is more SQL-like.  With Pig you can write complex data transformations on a data set such as aggregate, join and sort.  It can be extended using User Defined Functions in Java and called directly by Pig.

While Hadoop is a natural choice for processing unstructured and semi-structured data like logs and files, there may be a need to process structured data stored in relational databases as well.  Sqoop (SQL-to-Hadoop) is a tool that allows you to import structured data from SQL Server and SQL Azure to HDFS and then use it in MapReduce and Hive jobs.  You can also use Sqoop to move data from HDFS to SQL Server.

The main benefits of Hadoop:

  • Provides storage for big data at a reasonable cost, since it is build around commodity hardware
  • Provides a robust environment as it was designed to provide a fault-tolerant environment and high throughput for extremely large datasets
  • Allows for the capture of new or more data such as unstructured, semi-structured, and structured in batch or real-time
  • Does not require a predefined data schema.  The consuming programs will apply structure when necessary
  • Data can be stored longer, so you no longer have to purge older data
  • Provides scalable analytics via distributed storage and distributed processing.  Hadoop clusters can scale to between 6,000 and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000 concurrent jobs
  • Provides rich analytics via support for languages such as Java, Mahout, Ruby, Python, and R

Some of the drivers of Hadoop adoption:

  • Growing data storage needs due to explosion of unstructured data
  • Anticipated storage costs
  • Flexibility to experiment with new data sources in all shapes and sizes
  • Scalability: no data left behind
  • Peer pressure

How Hadoop fits in with the Parallel Data Warehouse (PDW) and Polybase:

Untitled picture

More info:

Big Data Basics – Part 3 – Overview of Hadoop

Video Hadoop Tutorial: Core Apache Hadoop

Hadoop Tutorial: Intro to HDFS

Hadoop: What it is, how it works, and what it can do

Hive Data Warehouse: Lessons Learned

The Hadoop Distributed File System: Architecture and Design

Hadoop Distributed File System (HDFS)

Gartner Survey Highlights Challenges to Hadoop Adoption

About James Serra

James is a big data and data warehousing solution architect at Microsoft. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 25 years of IT experience.
This entry was posted in Hadoop, PDW/APS, SQLServerPedia Syndication. Bookmark the permalink.

8 Responses to Introduction to Hadoop

  1. Pingback: What is HDInsight? | James Serra's Blog

  2. Pingback: PolyBase explained | James Serra's Blog

  3. Megan Brooks says:

    There seems to be a problem with the blog post as it appears on, although it looks fine here on your own blog. The SSC version as I am seeing it (on both IE and Chrome) is formatted non-responsively in a very wide format, such that I had to reduce the font size so much that I couldn’t see the letters, in order to fit the lines on my screen.

    Your introduction to Hadoop is helpful. I have not needed to deal with “big data” myself as yet, but articles like this one help me to keep track of what’s happening in the field and be better prepared when it does come up.

  4. Gary Clausen says:

    I have played around with MongoDB, CouchBase, and Oracle NoSQL databases as well as Hadoop (Sandbox). In my opinion, Hadoop appears to be the easiest to work with especially when loading data.

  5. Pingback: Parallel Data Warehouse (PDW) benefits made simple - SQL Server - SQL Server - Toad World

  6. Pingback: Parallel Data Warehouse (PDW) benefits made simple | James Serra's Blog

  7. Pingback: Parallel Data Warehouse (PDW) AU1 released - SQL Server - SQL Server - Toad World

  8. Pingback: Hadoop and Microsoft | James Serra's Blog