What is ‘Big data’?

In simple terms, Big data is data which can’t comfortably be processed on a single machine. We live in a world increasingly driven by data. Try googling about big data and you will find some staggering facts like :

  • The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. The same amount was created in every two days in 2011, and in every ten minutes in 2013 !!!
  • 90% of the world’s data was generated in the last few years.
  • A report from McKinsey & Co. stated that by 2009, companies with more than 1,000 employees already had more than 200 terabytes of data of their customer’s lives stored.

Big Data is often characterized by 3 Vs (well sometimes even more :P)

  • Volume refers to the size of data that you’re dealing with.
  • Variety refers to the fact that the data is often coming from lots of different sources and in many different formats.
  • Velocity refers to the speed at which the data is being generated, and the speed at which it needs to be made available for processing.

How useful is Big Data ?

It is very useful….

Think about an e-commerce website. How does Amazon recommend similar products to what you have looked at in the past?

Earlier we could only store records of actual purchases. But nowadays,we can store and process all of our Web server log files, along with the purchase data that’s in our traditional data warehouse, with this data we can give the customer a much better shopping experience which should directly translate into bigger profits.

Another example is Facebook, every day we feed Facebook’s data beast with mounds of information in the form of likes, messages and picture uploads. With data like this, Facebook knows who our friends are, what we look like, where we are, what we are doing, our likes, our dislikes, and so much more. With this information it is able to suggest us friends, target us with activities and advertisements, use our past activities to get to know our interests and a lot more.

In fact, Facebook Inc. analytics chief Ken Rudin says, “Big Data is crucial to the company’s very being”.

Challenges with Big Data…

  • Storing huge amounts of data.
  • Storing data in varying formats.
  • Time taken to process the data.

This is where Hadoop comes into the picture.

Meet the Elephant - Hadoop !

Hadoop Logo

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. Hadoop with its distributed processing, handles large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse.

How does Hadoop help ?

The key concept in hadoop is that we split the the data up and store it across a collection of machines, known as a cluster. Then, when we want to process the data, we process it where it’s actually stored. Rather than retrieving the data from a central server, instead it’s already on the cluster, and we can process it in place.

History…

It all started when Google published a paper about their distributed file system and about their processing framework, MapReduce in 2003. At the same time computer scientists Doug Cutting and Mike Cafarella were working on an Open Source web search engine called Nutch. They needed something scalable because the web was you know, billions of pages - terabytes, petabytes, of data, that needed to be processed. So they decided to reimplement what Google had done in open source. After a couple of years Nutch was up and running on 20 to 40 machines. It wasn’t perfect, it wasn’t totally reliable, but it worked. In January, 2006 Doug went to work for Yahoo and it was there that ‘Hadoop’ came into being. After years of development within the open source community, Hadoop 1.0 became publically available in November 2012 as part of the Apache project sponsored by the Apache Software Foundation.

The Hadoop ecosystem

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called Hadoop MapReduce. Hadoop Yet Another Resource Negotiator (YARN) is the brain of the Hadoop Ecosystem, which provides resource management and scheduling for user applications.

Today, Hadoop is a collection of related subprojects that fall under the umbrella of infrastructure for distributed computing.

These include:

  • Flume : A tool used to collect and move huge amounts of streaming data into HDFS.
  • Hbase : An open source, nonrelational, distributed database.
  • Hive : A data warehouse that provides data summarization, query and analysis.
  • Oozie : A server-based workflow scheduling system to manage Hadoop jobs.
  • Phoenix : An open source, massively parallel processing, relational database engine for Hadoop that is based on Apache Hbase.
  • Pig : A high-level platform for creating programs that run on Hadoop.
  • Sqoop : A tool to transfer bulk data between Hadoop and relational databases.
  • Spark : A fast engine for big data processing capable of streaming and supporting SQL, machine learning and graph processing.
  • Storm : An open source data processing system.
  • ZooKeeper : An open source configuration, synchronization and naming registry service for large distributed systems.

Some great resources to get started with Hadoop :

Happy Hadooping!