Machine Learning with Apache Spark Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Horizontal scaling

First of all, let's return to some of the data-centric problems that we described earlier. Given the huge explosion in the mass creation and consumption of data today, clearly we cannot continue to keep adding CPUs, memory, and/or hard drives to a single machine (in other words, vertical scaling). If we did, there would very quickly come a point where migrating to more powerful hardware would lead to diminishing returns while incurring significant costs. Furthermore, the ability to scale would be physically bounded by the biggest machine available to us, thereby limiting the growth potential of an organization.

Horizontal scaling, of which sharding is an example, is the process by which we can increase or decrease the amount of computational resources available to us via the addition or removal of hardware and/or software. Typically, this would involve the addition (or removal) of servers or nodes to a cluster of nodes. Crucially, however, the cluster acts as a single logical unit at all times, meaning that it will still continue to function and process requests regardless of whether resources were being added to it or taken away. The difference between horizontal and vertical scaling is illustrated in Figure 1.2:

Figure 1.2: Vertical scaling versus horizontal scaling