Serving layer
The serving layer consists of various operations that require both sequential and random access to disk depending on the nature of the data delivery. For instance, if an export is required to be performed for large datasets, the serving layer would mostly trigger a batch process to export the required data, which will largely depend on sequential data access. But if the data delivery is in the form of data services, the required disk access must support random data access to ensure that the response time service expectations are met.
Figure 10: Data Access Patterns
Thus, the data in a Data Lake can be broadly classified into two categories on the basis of accessibility, that is, Non-Indexed Data and Indexed Data.
- Indexed Data: In the context of maintaining indexed data in a Data Lake, we are looking at maintaining data that can be randomly addressed and accessed. The underlying hardware also plays a vital role in supporting the storage and access patterns for random data. As mentioned before, SSD would surely be a good fit here. But there are other things associated like cost, failure frequency, availability, storage volume, power consumption and so on, which motivates us to think about the spinning disks. In this context, there is a trade-off expected in I/O rates. At the same time, one can also think about tiered storage here such that the high IO/transaction data indexes stay in SSD while the rest falls to spinning disk.
Today, almost all data indexing frameworks support both SSD as well as Spinning disks. Some of the leading frameworks in this space, who have been widely used in the context of Big Data technology and Lambda Architecture are Solr and Elastic. Both of these frameworks are based on the Lucene engine and depend on the open source Lucene engine for core indexing capabilities with added capabilities around data indexing and access. These frameworks guarantee sub second response time over large volumes of data which fits well into the need for speed layer for fast lookups and persistence. Both Elastic and Solr store indexes of the data and can optionally also store the data for fast lookup.
- Non-indexed Data: The raw data as ingested into a data lake is stored sequentially and in blocks of data generally. These blocks of data form a unit of data for processing. Since the non-indexed data is stored sequentially, it is used for batch data processing and the output data is also stored sequentially. Since this is sequential data, there is limited lookup capability based on keys that few of the storage formats support. Some of the data storage also support some level of indexing and partitioning of data such that the data can be located as fast as possible. This data is generally used for batch processing and in order to make the batch process execute fast, the batch process runs near to the data such that movement of data for processing is minimized, taking advantage of data localization. This is one of the key reasons for a fast map-reduce process in Hadoop ecosystem.
- Storage-Based Classification: While these differences are related more to access patterns, the data stores can also be classified based on storage mechanisms.
Figure 11: Data Stores