Machine Learning with Apache Spark Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Distributed messaging

Continuing our journey through distributed systems, the next category that we will discuss is distributed messaging systems. Typically, real-world IT systems are, in fact, a collection of distinct applications, potentially written in different languages using different underlying technologies and frameworks, that are integrated with one another. In order for messages to be sent between distinct applications, developers could potentially code the consumption logic into each individual application. This is a bad idea however—what happens if the type of message sent by an upstream application changes? In this case, the consumption logic would have to be rewritten, relevant applications updated, and the whole system retested.

Messaging systems, such as Apache ActiveMQ and RabbitMQ, overcome this problem by providing a middleman called a message broker. Figure 1.10 illustrates how message brokers work at a high level:

Figure 1.10: Message broker high-level overview

At a high level, Producers are applications that generate and send messages required for the functionality of the system. The Message Broker receives these messages and stores them inside queue data structures or buffers. Consumer applications, which are applications designed to process messages, subscribe to the message broker. The message broker then delivers these messages to the Consumer applications which consume and process them. Note that a single application can be a producer, consumer, or both.

Distributed messaging systems, which is one use case of Apache Kafka, extend traditional messaging systems by being able to partition and scale horizontally, while offering high throughput, high performance, fault tolerance, and replication, like many other distributed systems. This means that messages are never lost, while offering the ability to load balance requests and provide ordering guarantees. We will discuss Apache Kafka in more detail next, but in the context of a distributed streaming platform for real-time data.