Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Columnar databases

Relational databases traditionally persist each row of data contiguously, meaning that each row will be stored in sequential blocks on disk. This type of database is referred to as row-oriented. For operations involving typical statistical aggregations such as calculating the average of a particular attribute, the effect of row-oriented databases is that every attribute in that row is read during processing, regardless of whether they are relevant to the query or not. In general, row-oriented databases are best suited for transactional workloads, also known as online transaction processing (OLTP), where individual rows are frequently written to and where the emphasis is on processing a large number of relatively simple queries, such as short inserts and updates, quickly. Examples of use cases include retail and financial transactions where database schemas tend to be highly normalized.

On the other hand, columnar databases such as Apache Cassandra and Apache HBase are column-oriented, meaning that each column is persisted in sequential blocks on disk. The effect of column-oriented databases is that individual attributes can be accessed together as a group, rather than individually by row, thereby reducing disk I/O for analytical queries since the amount of data that is loaded from disk is reduced. For example, consider the following table:

In a row-oriented database, the data is persisted to disk as follows:

(1001, USB drive 64 GB, storage, 25.00), (1002, SATA HDD 1 TB, storage, 50.00), (1003, SSD 256 GB, storage, 60.00)

However, in a column-oriented database, the data is persisted to disk as follows:

(1001, 1002, 1003), (USB drive 64 GB, SATA HDD 1 TB, SSD 256 GB), (storage, storage, storage), (25.00, 50.00, 60.00)

In general, column-oriented databases are best suited for analytical workloads, also known as online analytical processing (OLAP), where the emphasis is on processing a low number of complex analytical queries typically involving aggregations. Examples of use cases include data mining and statistical analysis, where database schemas tend to be either denormalized or follow a star or snowflake schema design.