Data Management: Diving into Data Lakes
A roundup of development best practices
Data - let's call it what it is: critical information that your business needs to know, understand, and use. But we can't talk about data today without also delving into its mind-boggling size and diversity, and the challenges that then come with managing it and making it useful.
The volume and speed with which data now enters organizations means that immediate storage of unstructured data is crucial, as is developing a process for easy retrieval, integration, cleaning, and protection. In light of this, many organizations are turning to Data Lakes as a landing zone for newly discovered data and trying to build out processes for safely migrating raw data into actionable insights.
Below we've pulled together some recent insights and best practices on data lake development, some of which can be applied to any level of data management - big or small - and help your organization become more responsive to data demands.
Data Lake Definition
A data lake is a storage method/repository that holds raw data in its native format until it is needed, including structured, semi-structured, and unstructured data. The structure and requirements of the data are only defined once the data is needed.
Data Lakes can serve many purposes. Their primary purpose may be to ingest newly discovered and/or generated data quickly so that it can be retrieved for operations and analytics. They also serve as "landing zones" for new data, where it can exist in its raw state and be routinely extracted and processed in different ways as queries and analytics evolve.
Data Lake Development Best Practice 101
We love "The Data Lakes Manifesto," published by Phillip Rossum of TDWI (Transform Data with Intelligence), which outlines a list of the top 10 best practices for data lake design and use, each stated as an actionable recommendation. The list, which we've shared a condensed version of below, not only shares useful tips for data lake development but also demonstrates how useful data lakes can be when they are well-designed and well-managed. Click here to read the full manifesto.
1. Onboard and ingest data quickly with little or no up-front improvement.
2. Control who loads which data into the lake and when or how it is loaded.
3. Persist data in a raw state to preserve its original details and schema.
4. Improve data at read time as lake data is accessed and processed.
5. Capture big data and other new data sources in the data lake.
6. Integrate data of diverse sources, structures, and vintages.
7. Extend and improve enterprise data architectures, both old and new.
8. Make each data lake serve multiple technical and architectural purposes.
9. Enable new self-service data-driven business best practices.
10. Select data management platforms that satisfy data lake requirements.
Data lake development and good governance
A data lake acts as a landing zone for newly discovered information, and as such, should set the stage for good data management, protection and compliance protocols right from the start. Core General Data Protection Regulation (GDPR) components, like tracking data lineage, managing data lifecycles, and monitoring personal identifiable information, all come into play in this very first engagement with new data.
Being intentional about instilling good governance principles right off the bat not only helps ensure compliance and security, but turns data storage into something much more useful for the organization. Zaloni, a data lake servicer, recommends that users stop thinking of a data lake as simple repository, and instead like an operationalized platform where protocols are fully developed.
Data Lake Development: the agile approach
Finally, McKinsey & Co highlight what they call an agile approach to data lake development to help companies launch analytics programs quickly and establish a data-friendly culture for the longterm. An agile approach incorporates feedback throughout the development process, while allowing data engineers to tinker with infrastructure, like internal processes and data governance protocols, as the data lakes are filled. This fosters the creation of something useful, nimble, and responsive to any individual company's data demands and requirements. To read more about McKinsey's agile approach to data lake development, click here.
Stay tuned for much more on data lakes, storage, and protection over the course of the next few months as we continue to explore all the biggest 'big data' trends.
For more reading on data lakes, take a look at some of the following articles:
Data Lakes Checklist for Success from Zaloni
The Data Lakes Manifesto from TDWI
A Smarter Way to Jump into Data Lakes from McKinsey & Co.