Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Data processing layer

The data processing layer is responsible for the transformation, enrichment, and validation of the raw data gathered from either the persistent data store or directly from the ingestion layer. The data processing layer models the data according to downstream business and analytical requirements and prepares it for either persistence in the serving data storage layer, or for processing by data intelligence applications. Again, the data processing layer must be capable of processing both batch data and stream-based event data. Examples of open source technologies used to implement the data processing layer include the following:

Apache Hive
Apache Spark, including Spark Streaming (DStreams) and Structured Streaming
Apache Kafka
Apache Storm