Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Artificial intelligence and machine learning

We have discussed how distributed systems can be employed to store, model, and process huge amounts of structured, semi-structured, and unstructured data, while providing horizontal scalability, fault tolerance, resilience, high availability, consistency, and high throughput. However, other fields of study have become prevalent today, seemingly in conjunction with the rise of big data—artificial intelligence and machine learning.

But why have these fields of study, the underlying mathematical theories of which have been around for decades, and even centuries in some cases, risen to prominence at the same time as big data? The answer to this question lies in understanding the benefits offered by this new breed of technology.

Distributed systems allow us to consolidate, aggregate, transform, process, and analyze vast volumes of previously disparate data. The process of consolidating these disparate datasets allows us to infer insights and uncover hidden relationships that would have been impossible previously. Furthermore, cluster computing, such as that offered by distributed systems, exposes more powerful and numerous hardware and software working together as a single logical unit that can be assigned to solve complex computational tasks such as those inherent to artificial intelligence and machine learning. Today, by combining these features, we can efficiently run advanced analytical algorithms to ultimately provide actionable insights, the level and breadth of which have never been seen before in many mainstream industries.

Apache Spark's machine learning library, MLlib, and TensorFlow are examples of libraries that have been developed to allow us to quickly and efficiently engineer and execute machine learning algorithms as part of analytical processing pipelines.

In Chapter 3, Artificial Intelligence and Machine Learning, we will discuss some of the high-level concepts behind common artificial intelligence and machine learning algorithms, as well as the logical architecture behind Apache Spark's machine learning library MLlib. Thereafter, in Chapter 4, Supervised Learning Using Apache Spark, through to Chapter 8, Real-Time Machine Learning Using Apache Spark, we will develop advanced analytical models with MLlib using real-world use cases, while exploring their underlying mathematical theory.

To learn more about MLlib and TensorFlow, please visit https://spark.apache.org/mllib/ and https://www.tensorflow.org/ respectively.