Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

CAP theorem

As discussed previously, distributed data stores allow us to store huge volumes of data while providing the ability to horizontally scale as a single logical unit at all times. Inherent to many distributed data stores are the following features:

Consistency refers to the guarantee that every client has the same view of the data. In practice, this means that a read request to any node in the cluster should return the results of the most recent successful write request. Immediate consistency refers to the guarantee that the most recent successful write request should be immediately available to any client.
Availability refers to the guarantee that the system responds to every request made by a client, whether that request was successful or not. In practice, this means that every client request receives a response regardless of whether individual nodes are non-functional.
Partition tolerance refers to the guarantee of resilience given a failure in inter-node network communication. In other words, in the event that there is a network failure between a particular node and another set of nodes, referred to as a network partition, the system will continue to function. In practice, this means that the system should have the ability to replicate data across the functional parts of the cluster to cater for intermittent network failures and in order to guarantee that data is not lost. Thereafter, the system should heal gracefully once the partition has been resolved.

The CAP theorem simply states that a distributed system cannot simultaneously be immediately consistent, available, and partition-tolerant. A distributed system can simultaneously only ever offer any two of the three. This is illustrated in Figure 1.7:

Figure 1.7: The CAP theorem

CA distributed systems offer immediate consistency and high availability, but are not tolerant to inter-node network failure, meaning that data could be lost. CP distributed systems offer immediate consistency and are resilient to network failure, with no data loss. However, they may not respond in the event of an inter-node network failure. AP distributed systems offer high availability and resilience to network failure with no data loss. However, read requests may not return the most recent data.

Distributed systems, such as Apache Cassandra, allow for the configuration of the level of consistency required. For example, let's assume we have provisioned an Apache Cassandra cluster with a replication factor of 3. In Apache Cassandra, a consistency configuration of ONE means that a write request is considered successful as soon as one copy of the data is persisted, without the need to wait for the other two replicas to be written. In this case, the system is said to be eventually consistent, as the other two replicas will eventually be persisted. A subsequent and immediate read request may either return the latest data if it is processed by the updated replica, or it may return outdated data if it is processed by one of the other two replicas that have yet to be updated (but will eventually be). In this scenario, Cassandra is an AP distributed system exhibiting eventual consistency and the tolerance of all but one of the replicas failing. It also provides the fastest performing system in this context.

A consistency configuration of ALL in Apache Cassandra means that a write request is considered successful only if all replicas are persisted successfully. A subsequent and immediate read request will always return the latest data. In this scenario, Cassandra is a CA distributed system exhibiting immediate consistency, but with no tolerance of failure. It also provides the slowest performing system in this context.

Finally, a consistency configuration of QUORUM in Apache Cassandra means that a write request is considered successful only when a strict majority of replicas are persisted successfully. A subsequent and immediate read request also using QUORUM consistency will wait until data from two replicas (in the case of a replication factor of 3) is received and, by comparing timestamps, will always return the latest data. In this scenario, Cassandra is also a CA distributed system exhibiting immediate consistency, but with the tolerance of a minority of the replicas failing. It also provides a median-performing system in this context.

Ultimately, in the real world, however, data loss is not an option for most business critical systems and a trade-off between performance, consistency and availability is required. Therefore, the choice tends to come down to either CP or AP distributed systems, with the winner driven by business requirements.