Data lakes face significant data reliability challenges. Failure to address them effectively can adversely impact analytics and Machine Learning initiatives.
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Data Pipeline Key Goals
Making quality data available in a reliable manner is a major determinant of success for data analytics initiatives be they regular dashboards or reports, or advanced analytics projects drawing on state of the art machine learning techniques. Data engineers tasked with this responsibility need to take account of a broad set of dependencies and requirements as they design and build their data pipelines.
Three primary goals that data engineers typically seek to address as they work to enable the analytics professionals in their organizations are:
- Deliver quality data in less time – when it comes to data, quality and timeliness are key. Data with gaps or errors (which can arise for many reasons) is “unreliable”, can lead to wrong conclusions, and is of limited value to downstream users. Equally well, many applications require up to date information (who wants to use last night’s closing stock price or weather forecast) and are of limited value without it.
- Enable faster queries – wanting fast responses to queries is natural enough in today’s “New York minute”, online world. Achieving this is particularly demanding when the queries are based on very large data sets.
- Simplify data engineering at scale – it is one thing to have high reliability and performance in a limited, development or test environment. What matters more is the ability to have robust, production data pipelines at scale without requiring high operational overhead.
Data Reliability Challenges with Data Lakes
- Failed Writes – if a production job that is writing data experiences failures which are inevitable in large distributed environments, it can result in data corruption through partial or multiple writes. What is needed is a mechanism that is able to ensure that either a write takes place completely or not at all (and not multiple times, adding spurious data). Failed jobs can impose a considerable burden to recover to a clean state.
- Schema Mismatch – when ingesting content from multiple sources, typical of large, modern big data environments, it can be difficult to ensure that the same data is encoded in the same way i.e. the schema matches. A similar challenge arises when the formats for data elements are changed without informing the data engineering team. Both can result in low quality, inconsistent data that requires cleaning up to improve its usability. The ability to observe and enforce schema would serve to mitigate this.
- Lack of Consistency – in a complex big data environment one may be interested in considering a mix of both batch and streaming data. Trying to read data while it is being appended to provides a challenge since on the one hand there is a desire to keep ingesting new data while on the other hand anyone reading the data prefers a consistent view. This is especially an issue when there are multiple readers and writers at work. It is undesirable and impractical, of course, to stop read access while writes complete or stop write access while a reads are in progress.
Read this ebook to gain an understanding of the key data reliability challenges typical data lakes face and how Delta Lake helps address those challenges.
Download the eBook, Building Reliable Data Lakes at Scale with Delta Lake, to learn more.