上QQ阅读APP看书,第一时间看更新
Batch layer - batch processing of ingested data
The batch processing layer of a Lambda Architecture is expected to process the ingested data in batches so as to ensure optimum utilization of system resources; at the same time, long-running operations may be applied to the data to ensure high quality of data output, which is also known as modeled data. The conversion of raw data to modeled data is the primary responsibility of this layer, wherein the modeled data is the data model that can be served by the serving layers of the Lambda Architecture. The primary specifications for this layer can be defined as follows:
- The batch layer must be able to apply data cleaning, data processing, and data modeling algorithms on the raw data ingested
- It must have mechanisms in place to replay/rerun the batches for recovery purposes
- The batch layer must be able to support machine learning and data science based processing on the raw ingested data to produce high quality of modeled data
- This layer may also have to perform some other operations to improve the quality of the overall modeled data by de-duplication, detecting erroneous data, and providing a view of the data lineage