• What is Hadoop?

    Hadoop is an open source apache projected started in the Year 2006 by Doug Cutting  it’s a distributed fault tolerant data storage and batch processing system for really huge datasets. Hadoop was primarily built out of two papers that where published by Google i.e. Google File Systems and Google MapReduce. These two papers are about how Google is storing & processing the massive data sets. One of the major advantage of Hadoop is it provides linear scalability by adding hardware to improve the performance of the processing i.e. you can add up the RAM’s & servers on the system and you are good to go.

     

    Why Hadoop is getting so popular?

    There are 3 primary reasons for the popularity of Hadoop:

    1. Flexibility: Being a file system at the core, Hadoop is extremely flexible as users are not confined to few algorithms provided by vendors. They can analyze the data using processors attached directly to disks containing the large data sets.
    2. Scalability: Running on HDFS (Hadoop Data File Systems), Hadoop has the ability to distribute large data sets across many servers running in parallel. Hence, Hadoop can scale up to large data sets simply by adding more servers & RAMs (to process the data sets) as compared to traditional database management systems.
    3. Economical: Being open source software & running on shared commodity servers which cost a lot less than normal systems; it’s more inexpensive than compared to other alternatives out there.

    Due to these three reasons, Hadoop is being extensively used by various Internet giants like Google, Facebook, Amazon, Yahoo!, eBay, IBM and many more companies.

    What are the things Hadoop is great at?

    1. Multi Petabytes Data: If your data is running in Petabytes then Hadoop provides a reliable storage for such large data sets.
    2. Batch Processing: As the data is running in Petabytes, Hadoop is not an interactive system. It is ideal for using  deep processing, indexing or hourly jobs.
    3. Complex Hierarchical Data: As Hadoop is a file system its really good for complex hierarchical data with often changing schemas and one can write application to view the changing schemas. Further, it supports structured and unstructured data in the file system.

     What are some drawbacks of Hadoop?

    1. Append only File System: One cannot make changes to files. You can only add data to the files.
    2. Not an Interactive System: As Hadoop is a batch system it takes time to process the data. Hence, you cannot expect it to return results in milliseconds.
    3. Only for Specialists: Until recently one needed to design custom application, custom Java codes & custom API;s to work with Hadoop. But, this is getting changed with new tools that are now available.
  • Top Hadoop Tools for Crunching Big Data

    In today’s world, 90% of the data has been created in the last two years. This data volume explosion, variety, and velocity is termed as Big Data and if you can hitch or crunch it, it will revolutionize the way you do business. It wasn’t too long ago that Hadoop came into the market with a shiny new technology for Big Data. Things changed fast and Hadoop is now a billion-dollar market, underpinning big data efforts by companies of all stripes.

     

    Hadoop is a highly scalable storage platform, its community is fast evolving and has a very prominent role in its eco-system. Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. Hadoop tools are designed to do such complicated tasks and scale up from single servers to thousands of machines, each offering local computation and storage. Some of the essential Hadoop tools for munching big data -

     

    Hadoop:

    Apache Hadoop is a free, Java-based programming framework, inspired by Google’s MapReduce framework that supports the processing of large data sets in a distributed computing environment. Generally, the entire group of map and reduce tools are termed as “Hadoop”. Hadoop runs applications on systems with thousands of nodes involving thousands of petabytes. With Hadoop, programmers can write code for data analysis.  Hadoop also manages fault and error from any individual machine.

     

    HDFS (Hadoop Distributed File System):

    Hadoop Distributed File System (HDFS) is a distributed file system under Apache license that offers a basic framework for splitting up data collections between multiple nodes. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. In HDFS, the large files are broken into blocks where many nodes hold those blocks from a file for maintaining steady streaming.

     

    HBase:

    HBase is a key/value store, column-oriented open source and non-relational database management system that runs on top of HDFS. Specifically it is a sparse, consistent, distributed, multidimensional, sorted map. HBase applications are written in Java which comprises a set of tables that store the data, search it and automatically share the table across multiple nodes so that MapReduce jobs can run it locally.

     

    Pig:

    Pig Latin is a simple-to-understand data flow language used in the analysis of large data sets. Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster even if you aren’t familiar with MapReduce. Pig Latin is filled with abstractions for handling the data. Pig also allows the user to define their own functions.

    Ambari:

    Apache Ambari is the go-to tool for management of Hortonworks Data Platform. It is open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. It offers a web based GUI (Graphical User Interface) with wizard scripts for setting up clusters with most of the standard components.

    Hive:

    Apache Hive regulates the process of extracting bits from all the files in HBase. It supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems. It also provides an SQL like language called HSQL (HiveSQL) that gets into the files and extracts the required snippets for the code.

    NoSQL:

    The misleading term “NoSQl” is “Not Only SQL” which is more cloud friendly approach. It is unlike RDBMS. Some Hadoop clusters integrate with NoSQL data stores that come with their own mechanisms for storing data across a cluster of nodes.  This allows them to store and retrieve data with all the features of the NoSQL database, after which Hadoop can be used to schedule data analysis jobs on the same cluster.

     

    Sqoop:

    Apache Sqoop is specially designed to transfer bulk data efficiently from the traditional databases into Hive or HBase. Sqoop is a command line tool, mapping between the tables and the data storage layer, translating the tables into a configurable combination of HDFS, HBase or Hive.

    Zookeeper:

    Apache ZooKeeper is offered as part of the Hadoop ecosytem as a centralized service that maintains, configures information, gives a name and provides distributed synchronization across a cluster. It imposes a file system-like hierarchy on the cluster and stores all of the metadata for the machines, so we can synchronize the work of the various machines.

    Oozie:

    A Java Web-application framework allowing to combine multiple Map/Reduce jobs into a logical unit of work. On breaking the project in to multiple Hadoop jobs, Oozie starts processing them in sequence. It manages the workflow by scheduling.

     

    Other Hadoop tools for crunching the data are Flume, Spark, MongoDB, Cassandra, Avro and Mahout.