What is a Data Lake?
A data lake and a data warehouse are both design patterns, but they are opposites. Data warehouses structure and package data for the sake of quality, consistency, reuse, and performance with high concurrency. Data lakes complement warehouses with a design pattern that focuses on original raw data fidelity and long-term storage at a low cost while providing a new form of analytical agility.
The Value in Data Lakes
Data lakes meet the need to economically harness and derive value from exploding data volumes. This “dark” data from new sources—web, mobile, connected devices—was often discarded in the past, but it contains valuable insight. Massive volumes, plus new forms of
analytics, demand a new way to manage and derive value from data.
A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including
data marts, data warehouses, and recommendation engines.
Prior to the
big data trend, data integration normalized information in some sort of persistence – such as a database – and that created the value. This alone is no longer enough to manage all data in the enterprise and attempting to structure it all undermines the value. That’s why dark data is rarely captured in a database, but data scientists often dig through dark data to find a few facts worth repeating.
Data Lake and New Forms of Analytics
Technologies such as Spark and other innovations enable the parallelization of procedural programming languages, and this has enabled an entirely new breed of analytics. These new forms of analytics can be efficiently processed at scale, like graph, text, and
machine learning algorithms that get an answer, then compare that answer to the next piece of data, and so on until a final output is reached.
Data Lake and Corporate Memory Retention
Archiving data that has not been used in a long time can save storage space in the data warehouse. Until the data lake design pattern came along, there was no other place to put colder data for occasional access except the high-performing data warehouse or the offline tape backup. With virtual query tools, users can easily access cold data in conjunction with the warm and hot data in the data warehouse through a single query.
Data Lake and Data Integration
The industry has come full circle on how to best squeeze data transformation costs. The data lake offers greater scalability than traditional ETL (extract, transform, load) servers at a lower cost, forcing companies to rethink their data integration architecture. Organizations employing modern best practices are rebalancing hundreds of data integration jobs across the data lake, data warehouse, and
ETL servers, as each has its own set of capabilities and economics.
Common Data Lake Pitfalls
On the surface, data lakes appear straightforward—offering a way to manage and exploit massive volumes of
structured and
unstructured data. But, they are not as simple as they seem, and failed data lake projects are not uncommon across many types of industries and organizations. Early data lake projects faced challenges because best practices had yet to emerge. Now a lack of solid design is the primary reason data lakes don’t deliver their full value.
Data silo and cluster proliferation: There is a notion that data lakes have a low barrier to entry and can be done makeshift in the
cloud. This leads to redundant data and inconsistency with no two data lakes reconciling, as well as synchronization problems.
Conflicting objectives for data access: There is a balancing act between determining how strict security measures should be versus agile access. Plans and procedures need to be in place that align all stakeholders.
Limited commercial-off-the-shelf tools: Many vendors claim to connect to
Hadoop or cloud object stores, but the offerings lack deep integration and most of these products were built for data warehouses, not data lakes.
Lack of end user adoption: Users have the perception—right or wrong—that it’s too complicated to get answers from data lakes because it requires premium coding skills, or they can’t find the needles they need within the data haystacks.
Data Lake Design Pattern
The data lake design pattern offers a set of workloads and expectations that guide a successful implementation. As data lake technology and experience matured, an architecture and corresponding requirements evolved such that leading vendors have agreement and best practices for implementations. Technologies are critical, but the design pattern – which is independent of technology – is paramount. A data lake can be built on multiple technologies. While the Hadoop Distributed File System (HDFS) is what most people think of first, it is not required.