Data management mistakes can ruin your data lake journey
Data lakes pose technology deployment and data management challenges that can leave analytics users high and dry if the implementation process isn't handled properly.
BACKGROUND IMAGE: tonefotografia/iStock
The concept of the data lake originated with big data's emergence as a core asset for companies and Hadoop's arrival...
Enjoy this article as well as all of our content, including E-Guides, news, tips and more.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
You also agree that your personal information may be transferred and processed in the United States, and that you have read and agree to the Terms of Use and the Privacy Policy.
as a platform for storing and managing it. However, blindly plunging into a Hadoop data lake implementation won't necessarily bring your organization into the big data age -- at least, not in a successful way.
That's particularly true in cases where data assets of all shapes and sizes are funneled into a Hadoop environment in an ungoverned manner. A haphazard approach of this sort leads to several challenges and problems that can severely hamper the use of a data lake to support big data analytics applications.
For example, you might not be able to document what data objects are stored in a data lake or their sources and provenance. That makes it difficult for data scientists and other analysts to find relevant data distributed across a Hadoop cluster and for data managers to track who accesses particular data sets and determine what level of access privileges are needed on them.
Organizing data and "bucketing" similar data objects together to help ease access and analysis is also challenging if you don't have a well-managed process.
None of these issues have to do with the physical architecture of the data lake or the underlying Hadoop environment. Rather, the biggest impediments to a successful data lake implementation result from inadequate planning and oversight on managing data.