Polyglot Persistence
Traditional databases are unable to scale to meet the demands of global cloud applications because they were designed to scale vertically. Modern databases have been designed to leverage sharding, which allows them to scale horizontally by partitioning the data across multiple nodes. This improves responsiveness because it reduces contention for disk access; it provides resilience because the data is also stored redundantly across several machines; and allows the database to be elastic, as the demand grows, by adding partitions. Yet, sharding is not sufficient in and of itself. Many specialized databases have been built on top of sharding that are optimized for very specific workloads.
Gone are the days where one size fits all. The least common denominator of traditional general-purpose databases cannot effectively handle the large variety of modern workloads. To meet the performance requirements of cloud-scale workloads, many different kinds of databases have been created that are highly optimized for the read and write characteristics of specific usage scenarios. This specialization has led to the adoption of a Polyglot Persistence approach where the system is composed of many different types of databases. Each component in the system uses the right storage technology for the job and often multiple kinds of databases per component. Some of the different categories of databases include key-value stores, document stores, search engines, graph databases, time series databases, blob or object storage, mobile offline-first databases, columnar or column-oriented data warehouses, and append-only streams. Note the important fact that streams are databases. Many newer database products are multi-model, in that they support multiple database types and even support the dialects of other popular databases.
The following diagram depicts an example of the persistence layer of a hypothetical e-commerce system. The example consists of six bounded isolated components:
The Authoring component is responsible for managing the authoring of products. The product metadata is stored in a JSON document table and related multimedia is stored in a blob store. All data change events are published to the event stream for consumption by other components. The Commerce component is responsible for making products available to customers. The component consumes authoring events and refines the product metadata into its own materialized views. The data is indexed in a search engine and blob storage is used for multimedia and product details that are served from the edge via the CDN. The component also consumes customer clickstream events and populates a graph database for driving future product recommendations.
The Account component is responsible for managing a customer's profile, preferences, shopping cart, and order history. To facilitate session consistency the customer's profile, preferences, cart, and recent orders are stored in an offline-first, mobile database in local storage. This ensures that customers see a consistent view of their data even if the data is not yet eventually consistent with the rest of the system. In extreme cases, it allows customers to see their data in offline mode. The offline-first database is synchronized to the cloud and potentially other devices on a periodic basis and on discrete actions, such as checkout. Synchronization events are also published to the event stream. Order status events are consumed from the event stream and synchronized to local storage across devices. The customer's complete order history is stored in a document store. The Processing component is responsible for processing orders. The component consumes checkout events and stores the orders in a document table for further processing. The order table is essentially a materialized view of the order information joined with the customer's account information at the point in time of the order. The component produces events to reflect the status of orders. The order table is aggressively purged of completed orders, as the analytics and archive components are tracking all events for reporting and compliance.
The Analytics component consumes all events and stores them in their raw format in the data lake in perpetuity. The data is also stored in a search engine for a complete index of all the events and their contents and for robust time series analysis of the events. Events are also refined into materialized views and stored in a data warehouse. The Archive component consumes all events for compliance with regulations. The data is stored and versioned in blob storage and aged into cold storage.
As you can see, each of the database types excels at supporting different application usage scenarios. By selecting the right database types on a component-by-component basis we greatly simplify the work needed to implement these applications. Yet managing all these different database types can be a burden. We will address this problem in the following section.