Resulting context
The primary benefit of this solution is that we facilitate our goal of creating an asynchronous, message-driven system while delegating the heavy lifting of running both an append-only, sharded database and stream processing clusters to the cloud provider. This in turn allows the team to focus on architecting an appropriate topology of streams to ensure proper bulkheads. When defining your stream topology, you should consider creating separate streams for different user communities. For example, a front-office stream would ensure customers are not impacted by a failure in the back-office stream used by employees.
Event Stream services are typically regional and do not support regional replication. This is actually a good thing, as a replicated event would ultimately be processed in multiple regions and most likely produce unintended results. It is the results of event processing that we want to replicate and this replication is handled per each cloud-native database as appropriate. Instead, we just want to embrace eventual consistency and know that any events currently in the stream of the affected region will eventually be processed. In all actuality, during an outage these events will likely be making their way through the system in fits and spurts as the region recovers. All new user interactions and thus new events would have failed over to another region. A user experience that is naturally designed for eventual consistency would not distinguish between different causes of inconsistency, such as intermittent latency, regional outage, or a component failure. In Chapter 4, Boundary Patterns, we will discuss such topics.
This leads to some of the interesting challenges of stream processing. Streams typically support at-least-once delivery, which means the smart endpoints need to be idempotent to account for multiple deliveries. Events will inevitably arrive out of order, such as when there is a fault and events are delegated to another component for error handling and eventually resubmitted. When target databases begin to throttle, we need to apply backpressure to reduce wasted processing. These topics, along with the benefits of micro-batches and functional reactive stream processing, are covered in the Stream Circuit Breaker pattern.
It needs reiterating that while producers are not coupled to specific consumers, the consumers are coupled to the event definitions. Thus it is important to have stable, published interfaces for the event definitions that are produced and consumed by each component. A standard event envelope format needs to be extended by all event types to allow for consistent processing of all events by all consumers. In the Trilateral API pattern we will discuss publishing and testing these interfaces. At a minimum, keep in mind that changes to these interfaces need to be backwards compatible in the same ways as synchronous interfaces.
It is recommended that you do not equate event types to topics, one for one. In some services, this is not even possible because in these services a stream is just a single pipe or channel with no concept of topics. While other services do have a distinct concept of a topic. In all the messaging systems I have ever used, I have always multiplexed event types through an architected topology consisting of a controlled number of channels or topics. This kind of inversion makes message-driven systems more manageable. For example, a system can easily have dozens, potentially hundreds, of event types. If we treat each event type as a physical construct that must be provisioned then this can easily get out of hand. In practice, event types are logical concepts that naturally coalesce into cohesive groups that should flow through a specific channel. So if your cloud provider's streaming services does not have a concept of topics then that is not a problem. If your streaming service does have topics then take care to utilize them as a grouping construct. In all cases, we want to provision multiple, distinct, and separate streams at a sufficiently coarse level to ensure that we have proper bulkheads, as mentioned previously. Producers should preferably emit all their events to just one of these streams. Consumers will need to process events from one or more streams, but typically not all of the streams. The data lake, as we will discuss in its own pattern, is one consumer that does consume from all streams.
One particular way that cloud-streaming services differ from self-managed solutions is in storage capacity. Cloud-streaming services manage the storage for you and the limit is a factor of the ingestion limits over typically one to seven days. This is important to note because after the specified number of days, the events are purged. Self-managed solutions obviously do not have storage limits and the typical practice is to keep events forever. The stream database itself becomes the data lake. This has some significant and obvious operational implications and risks. The cloud-streaming service approach is the much more manageable and low-risk solution when combined with the Data Lake pattern discussed later in this chapter. For now, know that the Data Lake pattern will consume all events from all streams and storage events in extremely durable blob storage in perpetuity. There is also a concern with cloud-streaming's fixed storage duration in the context of error handling and the potential for losing events when errors are not handled in a timely manner. We will discuss this topic in the Stream Circuit Breaker pattern, but keep in mind that the fail-safe is that all events are stored in the data lake and can be replayed in the case of an unforeseen mishap.
A perceived way that cloud-streaming services differ from self-managed solutions is in the number of supported concurrent reads per stream. Cloud-streaming services will throttle concurrent reads over a specified limit. The same thing happens in a self-managed solution, when the cluster is underpowered, but in a much more subtle, yet no less significant way. However, cloud-streaming services are priced and provisioned based on throughput, not on cluster size. Thus, their throttling limits are explicit. Unfortunately, too much emphasis is placed on these throttling metrics. They are an important indicator, with regard to ingress throughput, as to whether or not additional shards should be added. However, they are misleading with regard to egress throughput. In fact, read throttling tends to increase when ingress throughput is low and decrease when ingress throughput is high. This is because at low ingress, the reader's polling frequencies tend to line up and produce a sort of tidal wave of requests that exceed the limits. Whereas at higher ingress, the natural processing latency of the individual consumers tends to cause the polling of the consumers to weave below the limits.
Instead, for each consumer it is better to monitor its iterator age. This is a measure of how long a consumer is leaving events in the stream before processing them. The iterator age is generally more a factor of the combination of batch size, processing latency, and shard count than the result of any latency caused by read throttling. The vast majority of use cases are designed to assume eventual consistency, thus the potential for hundreds of milliseconds or even several seconds of additional intermittent latency is insignificant. When there are use cases that are sensitive to any latency, then these should be accounted for when architecting the stream topology and receive dedicated streams.
Each cloud provider has its own event streaming implementation, for example, AWS Kinesis, and Azure EventHub. Each is integrated with the provider's function-as-a-service offering, such as AWS Lambda and Azure Functions. More and more value-added cloud services emit their own events and integrate with function-as-a-service. We will discuss how to integrate these services into the event stream in the External Service Gateway pattern in Chapter 4, Boundary Patterns.