Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

MapReduce

MapReduce is an example of a distributed data processing paradigm capable of processing big data in parallel across a cluster of nodes. A MapReduce job splits a large dataset into independent chunks and consists of two stages—the first stage is the Map function that creates a map task for each range in the input, outputting a partitioned group of key-value pairs. The output of the map tasks then act as inputs to reduce tasks, whose job it is to combine and condense the relevant partitions in order to solve the analytical problem. Before beginning the map stage, data is often sorted or filtered based on some condition pertinent to the analysis being undertaken. Similarly, the output of the reduce function may be subject to a finalization function to further analyze the data.

Let's consider a simple example to bring this rather abstract definition to life. The example that we will consider is that of a word count. Suppose that we have a text file containing millions of lines of text, and we wish to count the number of occurrences of each unique word in this text file as a whole. Figure 1.8 illustrates how this analysis may be undertaken using the MapReduce paradigm:

Figure 1.8: Word count MapReduce program

In this example, the original text file containing millions of lines of text is split up into its individual lines. Map tasks are applied to ranges of those individual lines, splitting them into individual word tokens, in this case, using a whitespace tokenizer, and thereafter emitting a collection of key-value pairs where the key is the word.

A shuffling process is undertaken that transfers the partitioned key-value pairs emitted by the map tasks to the reduce tasks. Sorting of the key-value pairs, grouped by key, is also undertaken. This helps to identify when a new reduce task should start. To reduce the amount of data transferred from the map tasks to the reduce tasks during shuffling, an optional combiner may be specified that implements a local aggregation function. In this example, a combiner is specified that sums, locally, the number of occurrences of each key or word for each map output.

The reduce tasks then take those partitioned key-value pairs and reduce those values that share the same key, outputting new (but unsorted) key-value pairs unique by key. In this example, the reduce tasks simply sum the number of occurrences of that key. The final output of the MapReduce job in this case is simply a count of the number occurrences of each word across the whole original text file.

In this example, we used a simple text file that had been split up by a newline character that is then mapped to key-value pairs based on a whitespace tokenizer. But the same paradigm can easily be extended to distributed data stores, where large volumes of data have already been partitioned across a cluster, thereby allowing us to perform data processing on a huge scale.