Hadoop is an open source apache projected started in the Year 2006 by Doug Cutting it’s a distributed fault tolerant data storage and batch processing system for really huge datasets. Hadoop was primarily built out of two papers that where published by Google i.e. Google File Systems and Google MapReduce. These two papers are about how Google is storing & processing the massive data sets. One of the major advantage of Hadoop is it provides linear scalability by adding hardware to improve the performance of the processing i.e. you can add up the RAM’s & servers on the system and you are good to go.
Why Hadoop is getting so popular?
There are 3 primary reasons for the popularity of Hadoop:
- Flexibility: Being a file system at the core, Hadoop is extremely flexible as users are not confined to few algorithms provided by vendors. They can analyze the data using processors attached directly to disks containing the large data sets.
- Scalability: Running on HDFS (Hadoop Data File Systems), Hadoop has the ability to distribute large data sets across many servers running in parallel. Hence, Hadoop can scale up to large data sets simply by adding more servers & RAMs (to process the data sets) as compared to traditional database management systems.
- Economical: Being open source software & running on shared commodity servers which cost a lot less than normal systems; it’s more inexpensive than compared to other alternatives out there.
Due to these three reasons, Hadoop is being extensively used by various Internet giants like Google, Facebook, Amazon, Yahoo!, eBay, IBM and many more companies.
What are the things Hadoop is great at?
- Multi Petabytes Data: If your data is running in Petabytes then Hadoop provides a reliable storage for such large data sets.
- Batch Processing: As the data is running in Petabytes, Hadoop is not an interactive system. It is ideal for using deep processing, indexing or hourly jobs.
- Complex Hierarchical Data: As Hadoop is a file system its really good for complex hierarchical data with often changing schemas and one can write application to view the changing schemas. Further, it supports structured and unstructured data in the file system.
What are some drawbacks of Hadoop?
- Append only File System: One cannot make changes to files. You can only add data to the files.
- Not an Interactive System: As Hadoop is a batch system it takes time to process the data. Hence, you cannot expect it to return results in milliseconds.
- Only for Specialists: Until recently one needed to design custom application, custom Java codes & custom API;s to work with Hadoop. But, this is getting changed with new tools that are now available.