Context, problem, and forces
In our cloud-native system, we have chosen to leverage value-added cloud services to implement our event streaming and our databases. This empowers self-sufficient, full-stack teams to focus their efforts on the requirements of their components and delegate the complexity of operating these services to the cloud provider. We have also architected a topology of multiple event streams to connector our producer components to our consumer components in a well-reasoned manner, which provides proper bulkheads to ensure that a disruption in one stream does not impact the entire system. One side effect of using a cloud-streaming service is that these services only retain events in the stream for a short period of time, usually one to seven days.
In a reactive, cloud-native system, we are effectively turning the database inside out and leveraging the stream of events that are emitted by components as the ultimate transaction log and system of record for the entire system. As such, we need to maintain a complete record of all events ever emitted by the components of the system. There can be no loss of fidelity in the event information, the storage must be highly durable with replication for disaster recovery, and the data must be reasonably accessible for analysis and replay.
The components of the system are designed to consume events from the streaming service. With the historical events no longer in the streams, there will need to be another mechanism to allow events to be replayed from the data lake as opposed to from a stream. Furthermore, the hand-off from the streams to the data lake is extremely important to system integrity and thus cannot drop a single event. The single data lake should also track the events from all the streams in the topology so that there is only one source of truth for all events in the system.
As the system of truth, it is also critical that the data is stored securely and that the data is cataloged, indexed, and searchable. The data lake is the source of knowledge that can be leveraged to replay events to repair components, populate new components, and support data science activities. But with this power comes the responsibility of properly controlling access to this wealth of knowledge and protect privacy.