• Why is Big Data important to an organization?

    Big Data is now the next big thing, expected to change everything about how businesses understand their customers, generate strategy, and go to market. While there is certainly a great amount of hype, Big Data and the associated technologies and tools do bring a lot of value to most organizations, and in many cases transformative impacts.

    First though, it is important to understand that Big Data does not mean the same thing to every organization, and therefore the benefits of Big Data processing may be very different from organization to organization. While Big Data refers to very vast datasets that are being generated at high speeds, there is no one consensus on how big is “Big”. For any organization, the point at which existing infrastructure is not able to handle the volume of data being generated and needs to be stored is really when it needs to start looking at technology that can handle much higher volumes of data. Big Data is also not just about volume, it’s also about variety – data that is being collected is much more than transactional data, it could include a wide variety of data points like customer reviews in text formats, facebook likes, and images and videos. Traditional database systems are not built to handle data that is in unstructured format, and so newer Big Data technologies are required to handle non-traditional data.

    So how can Big Data tools help organizations?  First, the speed at which analysis is performed and insights are generated can be increased dramatically, with real time analysis as opposed to retrospective analysis. This is enabled by sophisticated new Big Data technology that allows much faster querying and processing of even extremely large datasets

    Second, Big Data analysis outcomes are much more powerful because they are generated using a much wider set of information that includes more than the traditional data contained in transactional databases. For example, in healthcare, doctors are increasingly able to generate evidence based treatment programmes that not only take into account previous medical history of a patient but also such data as daily fitness activities and diet components over time.

    Third, Big Data algorithms and analytics have increased predictive accuracy because of a fundamental shift in approach that has been enabled by the availability of vast volumes of data. The new approach to predictive analysis is based on Bayesian statistics, which essentially allows analytical algorithms to constantly improve accuracy based on newer data.

    There are many examples of Big Data success stories and Big Data enabled high impact strategies across organizations across many sectors and industries. While large companies have been quick to embrace Big Data and have invested heavily in Big Data programmes and manpower, in anticipation of big returns, even smaller organizations should plan and generate a Big Data strategy to survive and thrive in an increasingly data driven environment.

  • The Human Face of Big Data

    Though the mystery of missing Malaysian airlines MH370 plane is still to be resolved even after a span of two months; the multi-national search co-ordination committee and other supporting partners are processing massive amounts of satellite, flight path, and ocean data commonly referred as “Big Data” to find clues that would lead them to possible debris location. Big Data is one of the most popular buzzwords in technology industry today with a promise of transforming our daily lives.

    Rapid evolution of internet and social networks spanning 100’s of millions of users has resulted in explosion of massive information that can be analysed for trends and correlations. Many experts believe that not long from now, we will all wear devices capable of capturing and storing every possible human interaction in real-time so that they can be retrieved and accessed whenever needed. To some extent we are already witnessing the rise of such devices in the likes of google glass project and wearable fitness trackers that are growing in terms of popularity and adoption rates. Even organizations worldwide have realized the value of the immense volume of data available, and are trying their best to manage, analyze, and unleash the power of data to build strategies and develop a competitive edge. Definitely the future of Big Data looks very promising with potential applications for enterprises and individuals alike.

    Apart from providing enterprise benefits, Big Data is also addressing many challenges of our planet in smarter ways. It has led to the beginning of new thinking which starts looking at entire human ecosystem as a nervous system with intricate connections spread across and abound with information. This machine enabled connectivity of billions of people not only enables us to contribute and consume information but also making each one of us play a more central role in the entire information lifecycle. These exabytes of information we generate coupled with the processing capabilities of emerging IT technologies such as Hadoop can lead to insights that can have a bigger impact on civilization beyond ever possibly achieved.

    In the field of utility consumption, a computer scientist and entrepreneur named Shwetak Patel has developed a way for households to track and monitor their utility consumption and further provides a better way to save on their bills. This innovative idea runs smart algorithms on data generated by wireless sensors plugged in every home to provide saving tips for households to act on a daily basis. On similar lines, Opower is another publicly held company that partners with utility providers around the world and provides energy consumption monitoring services to their customers with the help of smart meter technology. According to an official statement, an average customer using the Opower platform has cut energy usage by more than 2.5 percent.

    Another great example in the field of medicine would be the use of Big Data in Canada to detect infections in ICU babies by harnessing millions of heartbeat measurements each day, and detect any potential threats at least 24 hours before. This early detection would allow doctors to get a head-start on providing relevant treatment and save many innocent lives.

    Even in terms of early detection of earthquakes, Japan invested about half-a-billion dollars in installing hardwired sensor system on the ground to track the wave that comes before a violent earthquake. As a direct result of this, they managed to stop every bullet train and every factory 43 seconds before an earthquake hit back in 2011. However these examples are only a few out of many advances through Big Data across diverse fields such as social networks, smart cities, DNA sequencing, medicine, geophysical and ocean depth tracking and it is not hard to imagine a future where Big Data will become part of everything.

  • What is Big Data?

    Why is everyone talking so much about Big Data? What is it about this term that is getting the industry all in a frenzy? Is Big Data and analytics one and the same? Well I too had all these questions some time back. I went out and read a whole lot of books and blogs on the subject and today can say that I have some understanding of Big Data which I will share with you.

    One of the best definitions of Big Data I havefound isquite naturally on Wikepedia. They say that Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. As compared to traditional structured data which is typically stored in a relational database, big data is characterised by it’s volume, velocity, and variety. Lets understand more about these three Vs.

    By the way, it was Gartner analyst Doug Laney who first  introduced the concept of the 3Vs in his 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. It’s worthy to mention that of late few additional Vs are doing the rounds, variability – the increase in the range of values typical of a large data set — value, which addresses the need for valuation of enterprise data and volatility.


    Big data at the outset first means huge, gargantuan volumes of data. This data is generated by people, machines and networks. It is very common to have Terabytes and Petabytes of the storage system for enterprises. As the database grows in volume, the applications and architecture built to support the data needs to keep pace. This huge size of data is what represents Big Data.


    We are in the digital age and ‘recent data’ has a whole new meaning. Everything is real time and updates are reduced to fractions of seconds. This high velocity of data is another characteristic of Big Data. It refers to the pace at which data flows in from a variety of sources. The flow of data is continuous and in great amounts.


    When data comes in from a variety of sources both traditional and untraditional, in both structured and unstructured form, we call it variety. These varied sources create challenges in terms of storage, mining and analyzing and is another representation of what can be called Big Data.

    So why is the industry so excited about Big Data. Well in simple terms big data can create a significant competitive advantage for companies in every kind of industry. Big Data can also help to create new growth opportunities. There is even a whole new industry spurned out involving all those who manage this Big Data and aggregate and analyse it. And best of all, all of us as consumers also stand to gain from Big Data. It can and will even more potently in future impact our daily lives and make it better.

    It now makes it easier to understand the difference between Big Data and analytics. Big Data is the raw data that we have access to. It is huge, comes from various sources, it can be unstructured and untraditional and its velocity can be real time. Now the tools and technologies that is used to analyze this Big Data is what is called as analytics. Simple isn’t it? Yes that’s Big Data for you in a nutshell.

  • Top Hadoop Tools for Crunching Big Data

    In today’s world, 90% of the data has been created in the last two years. This data volume explosion, variety, and velocity is termed as Big Data and if you can hitch or crunch it, it will revolutionize the way you do business. It wasn’t too long ago that Hadoop came into the market with a shiny new technology for Big Data. Things changed fast and Hadoop is now a billion-dollar market, underpinning big data efforts by companies of all stripes.


    Hadoop is a highly scalable storage platform, its community is fast evolving and has a very prominent role in its eco-system. Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. Hadoop tools are designed to do such complicated tasks and scale up from single servers to thousands of machines, each offering local computation and storage. Some of the essential Hadoop tools for munching big data -



    Apache Hadoop is a free, Java-based programming framework, inspired by Google’s MapReduce framework that supports the processing of large data sets in a distributed computing environment. Generally, the entire group of map and reduce tools are termed as “Hadoop”. Hadoop runs applications on systems with thousands of nodes involving thousands of petabytes. With Hadoop, programmers can write code for data analysis.  Hadoop also manages fault and error from any individual machine.


    HDFS (Hadoop Distributed File System):

    Hadoop Distributed File System (HDFS) is a distributed file system under Apache license that offers a basic framework for splitting up data collections between multiple nodes. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. In HDFS, the large files are broken into blocks where many nodes hold those blocks from a file for maintaining steady streaming.



    HBase is a key/value store, column-oriented open source and non-relational database management system that runs on top of HDFS. Specifically it is a sparse, consistent, distributed, multidimensional, sorted map. HBase applications are written in Java which comprises a set of tables that store the data, search it and automatically share the table across multiple nodes so that MapReduce jobs can run it locally.



    Pig Latin is a simple-to-understand data flow language used in the analysis of large data sets. Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster even if you aren’t familiar with MapReduce. Pig Latin is filled with abstractions for handling the data. Pig also allows the user to define their own functions.


    Apache Ambari is the go-to tool for management of Hortonworks Data Platform. It is open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. It offers a web based GUI (Graphical User Interface) with wizard scripts for setting up clusters with most of the standard components.


    Apache Hive regulates the process of extracting bits from all the files in HBase. It supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems. It also provides an SQL like language called HSQL (HiveSQL) that gets into the files and extracts the required snippets for the code.


    The misleading term “NoSQl” is “Not Only SQL” which is more cloud friendly approach. It is unlike RDBMS. Some Hadoop clusters integrate with NoSQL data stores that come with their own mechanisms for storing data across a cluster of nodes.  This allows them to store and retrieve data with all the features of the NoSQL database, after which Hadoop can be used to schedule data analysis jobs on the same cluster.



    Apache Sqoop is specially designed to transfer bulk data efficiently from the traditional databases into Hive or HBase. Sqoop is a command line tool, mapping between the tables and the data storage layer, translating the tables into a configurable combination of HDFS, HBase or Hive.


    Apache ZooKeeper is offered as part of the Hadoop ecosytem as a centralized service that maintains, configures information, gives a name and provides distributed synchronization across a cluster. It imposes a file system-like hierarchy on the cluster and stores all of the metadata for the machines, so we can synchronize the work of the various machines.


    A Java Web-application framework allowing to combine multiple Map/Reduce jobs into a logical unit of work. On breaking the project in to multiple Hadoop jobs, Oozie starts processing them in sequence. It manages the workflow by scheduling.


    Other Hadoop tools for crunching the data are Flume, Spark, MongoDB, Cassandra, Avro and Mahout.

  • What Can Big Data Do For A Small Business?

    The goal of any business, big or small is to improve sales and profitability. The goal of any big data effort is to improve business. However complex big data may sound, if used correctly, small businesses can gain many insights and use it to make smart and intelligent decisions.

    There are several key advantages for small businesses that use big data. They can use it to identify key customers and improve their service to them. They can understand customer patterns, know when they’re likely to come in and reward them for multiple visits. This is also known as Loyalty Analytics. Whether one is in health care or the service industry, there is a wide scope of applying big data analytics. All it takes is creatively asking the right kind of questions that the data can answer. More or less data and analytics tools have been long used. While the analytical approach remains the same, what adds in with Big data is the use of adequateand more sophisticated technology.

    As data grows, so do the IT requirements. The challenge is to meet the gap of the business need and the IT infrastructure. To use Big Data smartly, the first requirement is to design the scope and objective of the analytics projects at hand. Based on the objective, the relevant data can be obtained. For example – Humungous amount of data can be generated at the customer transaction level. But suppose the business objective is to focus on customers in a certain geography, one has to think about how to filter the data to address the business problem.

    Big Data technology uses Hadoop and Mapreduce which is great but very expensive and not needed for small or mid- size businesses. Always keep an eye on the business objective before any investment is made purely based on price or hype. In most cases, we think about leveraging information management technologies like data integration and data quality to prepare data for analytics. Although this is certainly an important step, the biggest differentiator will be how business analytics can be applied to determine what to do with your organizational data, determine which data is relevant, and how or whether data should be stored.  While there is a lot of data, obtaining resources with the right kind of skills is a growing challenge. Some level of interaction and training within teams may also help in building skills relevant to generate business insights from big data.

    These days an increasing number of small businesses are collecting and crunching volumes of data to lift their sales.  There are many insights one can generate using Google Analytics for analysing web traffic, Facebook Insights, SumAll, etc. These tools are all easily available. The CRM tool collects all kinds of data which enables businesses to enhance user experience. Salesforce is a commonly used CRM platform.

    As big data grows and analytic tools become more affordable, small business must leverage from the big data movement. It’s time to look beyond what is a common belief that big data can only benefit large businesses. Small businesses that utilize big data will have a stronger understanding of their target markets and will be able to better cater to customers’ demand. Big data is a valuable asset for businesses of all size.