Big Data and the open source Hadoop platform have gotten a lot of hype lately. In this guest post, Nitin Bandugula discusses how the technology might be used by businesses.
Data Warehouse Offload to Hadoop is a powerful and proven strategy for lowering costs, improving operational efficiency and handling large volumes of data – including new categories of unstructured data.
But beyond these benefits a new possibility arises: using Hadoop as a data management hub to feed data into other analytics platforms while at the same time leveraging Hadoop itself as a Big Data analytics platform. This capability is discussed in the white paper, Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop, by Mike Ferguson of Intelligent Business Strategies.
As a data management hub, Hadoop has the capability to consume, clean, integrate, analyze and provision data to any analytical platform. When all data comes to Hadoop, it can be cleaned and transformed and then sent directly to data warehouses, NoSQL, MPP databases, analytical appliances, data marts, exploratory sandboxes and other destinations.
When considering Hadoop as an enterprise data hub, think of it as a data lake. Not only is data pumped into Hadoop, it’s also pumped out to data warehouses. One of the advantages of this approach is that in the ETL stage, analytics can be done in Hadoop before data is pumped to the warehouse, resulting in significant time savings. More importantly, as data volumes continue to grow rapidly, offloading ELT processing to Hadoop can help enterprises avoid costly data warehouse upgrades by freeing up considerable capacity on data warehouse platforms.
On the ETL tools front, to assimilate the insights produced on Hadoop, ETL tools must be able to extract those insights from the Hadoop platform and integrate it with other structured data before sending it to the data warehouse. Furthermore, this integration of insights from structured and unstructured data can happen on the Hadoop platform itself. The end result is that new insight derived from accelerated multi-platform analytics originating on Hadoop can be offered to users accessing a data warehouse.
The next consideration for maximizing the value derived from ETL processing in a Big Data setting is the ability to move data back and forth between Hadoop and other NoSQL and relational analytical platforms. As Ferguson points out in the white paper, with two-way data movement comes the capability to take dimension data into Hadoop and archive data from data warehouses into Hadoop. In addition, it is also possible to manage data across all data stores and analytical platforms in Hadoop.
To support this capability, MapR, a Hadoop vendor, provides an enhanced Hadoop Distribution to handle increased Big Data analytical demands as well as easy offloading of ETL processing from data warehouses. For instance, MapR’s NFS support allows easy movement of data into and out of Hadoop and its enterprise grade features ensure long-term reliable storage of data. (A more detailed explanation of MapR advantages and Hadoop as a data management hub can be found on pages 10-14 of the white paper.)
As data volumes from both structured and new unstructured data sources continue to grow at unprecedented rates, staging areas in warehouses designated for ETL data processing have by necessity become very large and very expensive. Utilizing the less costly Hadoop platform as a data management hub to support multi-platform big data analytics is an efficient and cost-effective alternative.
About the author: Nitin Bandugula is Product Marketing Manager at MapR Technologies. He holds a Masters degree in Computer Science from the Illinois Institute of Technology and an MBA from the Johnson School at Cornell University.