上QQ阅读APP看书,第一时间看更新
Data acquisition layer
In Chapter 2, Comprehensive Concepts of a Data Lake you got a glimpse of the data acquisition layer. This layer’s responsibility is to gather information from various source systems and induct it into the data lake. This figure will refresh your memory and give you a good pictorial view of this layer:
Figure 01: Data lake - data acquisition layer
The acquisition layer should be able to handle the following:
- Bulk data: Bulk data in the form of regular batches or micro-batches, as the case may be. Sqoop is able to handle huge amounts of bulk data and integrate it with the legacy applications datastore residing in traditional RDBMS. Micro-batch refers to more frequent bulk loads with less records to handle in each load. Sqoop is not the right choice here, rather Apache Flume (discussed in detail in the subsequent chapters, as we do have cases which require this) is a more apt choice.
- High-velocity data: Data varying from a few megabytes to terabytes in the form of regular batches and micro-batches needs to be handled by this layer efficiently without any bottleneck. One aspect is the speed at which this data comes (micro-batches can come more frequently and randomly as against regular batches which happen in a specified time interval), and another is the amount at which data comes into this layer.
- Different formats of data (disparate data): Different types of file formats (XML, JSON, TEXT, and so on) and different structured and unstructured data formats. Non-relational formats, such as various binary data, from various sources, such as IoT sensors, server logs, machine generated logs, image data, video data, and so on also has to be handled efficiently.
- Structured/unstructured data: The previous point covered this aspect but this point demands a separate mention because of its significance. Also, it has to cater semi-structured data, which falls in between structured and unstructured data. Chapter 1, Introduction to Data did cover these different data types in a bit more detail, so we wouldn't want to repeat ourselves here.
- Integration with diverse technologies and systems: With different types of business applications and Internet applications available in the enterprise, this layer has to integrate well with different technology applications and data stores with ease and ingest data into the data storage layer in our data lake.