Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Distributed processing

We have seen how distributed data stores such as the HDFS and Apache Cassandra allow us to store and model huge volumes of structured, semi-structured, and unstructured data partitioned over horizontally scalable clusters providing fault tolerance, resilience, high availability, and consistency. But in order to provide actionable insights and to deliver meaningful business value, we now need to be able to process and analyze all that data.

Let's return to the traditional data processing scenario we described near the start of this chapter. Typically, the data transformation and analytical programming code written by an analyst, data engineer or software engineer (for example, in SQL, Python, R or SAS) would rely upon the input data to be physically moved to the remote processing server or machine where the code to be executed resided. This would often take the form of a programmatic query embedded inside the code itself, for example, a SQL statement via an ODBC or JDBC connection, or via flat files such as CSV and XML files moved to the local filesystem. Although this approach works fine for small- to medium-sized datasets, there is a physical limit bounded by the computational resources available to the single remote processing server. Furthermore, the introduction of flat files such as CSV or XML files, introduces an additional, and often unnecessary, intermediate data store that requires management and increases disk I/O.

The major problem with this approach, however, is that the data needs to be moved to the code every time a job is executed. This very quickly becomes impractical when dealing with increased data volumes and frequencies, such as the volumes and frequencies we associate with big data.

We therefore need another data processing and programming paradigm—one where code is moved to the data and that works across a distributed cluster. In other words, we require distributed processing!

The fundamental idea behind distributed processing is the concept of splitting a computational processing task into smaller tasks. These smaller tasks are then distributed across the cluster and process specific partitions of the data. Typically, the computational tasks are co-located on the same nodes as the data itself to increase performance and reduce network I/O. The results from each of the smaller tasks are then aggregated in some manner before the final result is returned.