Context, problem, and forces
With cloud-native systems, we want to enable everyday companies and empower self-sufficient, full-stack teams to rapidly, continuously, and confidently deliver these global-scale systems. Modern high performance, horizontally scalable, sharded databases are critical to achieving global scalability. These databases come in many variations that are specialized for particular workload characteristics. Cloud-native systems are composed of many bounded isolated components. Each component has its own workload characteristics, which necessitates the use of polyglot persistence whereby many different types of databases are employed to support these characteristics.
To be truly isolated, cloud-native components need to control their own persistence. Each bounded isolated component needs to have its own dedicated databases to ensure proper bulkheads. Proper data level bulkheads help ensure that user interactions with components are responsive, that components are resilient to failures in upstream and downstream components, and that components can scale independently. Individual components, not just systems, can benefit from polyglot persistence. Thus, components need more than just their own dedicated database; each component may need multiple, dedicated databases. Mature cloud-native systems are also multi-regional and replicate data across regions. Data life cycle management is also necessary. For example, components that provide online transaction processing (OLTP) features benefit from continuously purging stale data to keep their databases lean. Furthermore, as the data in system ages and moves through different life cycle stages, it needs to propagate across components and across different types of databases.
Achieving the proper level of database isolation, though critical, is extremely complicated. Running a single database in the cloud across multiple availability zones is complex, running a sharded database in the cloud across availability zones is super complex, and running multiple sharded databases in the cloud can simply require more effort than most companies can afford. Going further, running many sharded databases, multiplied by many components, can seem unattainable. At this point in the typical discussion, there is literally no point in piling on the critical regional replication dimension. The workforce costs and the runtime cost of all these dimensions have the potential of being astronomical. Altogether this is a significant barrier to entry for proper data level bulkheads. This tends to force teams into a shared database model, despite the significant and potentially catastrophic pitfalls of the shared model.
What is worse is that the shared database model actually appears to be successful at first, at least for the initial components that use these resources. However, as more and more components leverage these shared resources, the responsiveness and scalability of all the components suffer and the impact of failure becomes more significant. These shared databases also tend to grow and grow because teams are not responsible for the full stack. Centralized operations teams own these shared databases. As a result, component teams neglect data life cycle concerns and the operations team lacks the functional knowledge to archive unneeded data, leading to even worse performance.
Vendor lock-in is a classic database concern. In traditional monolithic applications, object-relational mapping tools are leveraged to make applications database agnostic. In cloud-native systems, the startup overhead of these tools can be too much for short-lived component instances. Instead, the tendency is to leverage open source databases that can be deployed across multiple cloud providers. However, the learning curve for operating these databases is significant. Each database product has the potential to add a good 6 months or more to a project before a team can be confident in their ability to run these workloads in production. This further increases the tendency to leverage a shared database model. The effort required to move up the learning curve also tends to lock systems into the database technology decisions that were made early on in the system design, even after there is ample evidence that there are better alternatives.
The lack of distributed transactions based on two-phase commits is another concern. As a matter of course, modern databases favor the BASE model for distributed transactions: Basically Available, Soft state, Eventually consistent. The individual writes to each database follow the traditional ACID model, but a mechanism to achieve eventual consistency across databases is needed.