BACKGROUND IMAGE: tonefotografia/iStock

E-Handbook:

Data lake concept needs firm hand to pay big data dividends

Problem solve Get help with specific problems with your technologies, process and projects.

Seven steps to a successful data lake implementation

Flooding a Hadoop cluster with data that isn't organized and managed properly can stymie analytics efforts. Take these steps to help make your data lake accessible and usable.

The concept of the data lake originated with big data's emergence as a core asset for companies and Hadoop's arrival...

as a platform for storing and managing it. However, blindly plunging into a Hadoop data lake implementation won't necessarily bring your organization into the big data age -- at least, not in a successful way.

That's particularly true in cases where data assets of all shapes and sizes are funneled into a Hadoop environment in an ungoverned manner. A haphazard approach of this sort leads to several challenges and problems that can severely hamper the use of a data lake to support big data analytics applications.

For example, you might not be able to document what data objects are stored in a data lake or their sources and provenance. That makes it difficult for data scientists and other analysts to find relevant data distributed across a Hadoop cluster and for data managers to track who accesses particular data sets and determine what level of access privileges are needed on them.

Organizing data and "bucketing" similar data objects together to help ease access and analysis is also challenging if you don't have a well-managed process.

None of these issues have to do with the physical architecture of the data lake or the underlying Hadoop environment. Rather, the biggest impediments to a successful data lake implementation result from inadequate planning and oversight on managing data.

This was last published in February 2018