Object Storage Daemons (OSDs)
OSDs provide bulk storage of all user data within Ceph. Strictly speaking, an OSD is the operating system process (ceph-osd) running on a storage host that manages data reads, writes, and integrity. In practice, however, OSD is also used to refer to the underlying collection of data, the object storage device, that a given OSD manages. As the two are intimately linked, one can also quite reasonably think of an OSD as the logical combination of the process and underlying storage. At times, one may see OSD also used to refer to the entire server/host that houses these processes and data, though it is much better practice to refer to the server/host as an OSD node that houses as many as dozens of individual OSDs.
Each OSD within a Ceph cluster stores a subset of data. As we explored in Chapter 1, Introduction to Ceph Storage, Ceph is a distributed system without a centralized access bottleneck. Many traditional storage solutions contain one or two head units that are the only components that users interact with, a chokepoint of both control and data planes, which leads to performance bottlenecks and scaling limitations. Ceph clients, however—virtual machines, applications, and so on—communicate directly with the cluster's OSDs. Create, Read, Update, and Delete (CRUD) operations are sent by clients and performed by the OSD processes that manage the underlying storage.
The object storage device that a given OSD manages is usually a single Hard Disk Drive (HDD) or Solid State Device (SSD). Exotic architectures for the storage underlying a single OSD are uncommon but not unknown, and range from simple software or an HBA RAID volume to LUNs or iSCSI targets on external storage arrays, SanDisk's InfiniFlash, or even a ZFS ZVOL.
Ceph organizes data into units known as Placement Groups (PGs). A PG serves as the granularity at which various decisions and operations within the cluster operate. A well-utilized Ceph cluster will contain millions of low-level objects, a population that is unwieldy for high-level operations. PGs are collections of objects that are grouped together and typically number in the thousands to tens of thousands. Each PG maintains multiple copies on disjoint OSDs, nodes, racks, or even data centers as a key part of Ceph's passion for high availability and data durability. PGs are distributed according to defined constraints that avoid creating hotspots and to minimize the impact of server and infrastructure failures. By default, Ceph maintains three replicas of data, realized by placing a copy of each PG on three different OSDs located on three different servers. For additional fault tolerance, configuration can be added to ensure that those servers are located within three separate data center racks.
At any given time, one OSDs copy of a PG is designated the primary and the others as secondaries; one important distinction is that at this time, all client read and write operations are directed at the primary OSD. The additional OSDs containing copies of a PG may be thought of as slaves or secondaries; the latter term is more frequently used. Recent releases of Ceph include an alternative known as erasure coding; we will explore that in later chapters.
Ceph's constellation of OSDs maintain periodic contact with each other both to ensure consistency, and also to take steps if a given OSD becomes unavailable. When a given OSDs process or host crashes, experiences hardware or network issues, or in other ways becomes unresponsive, other OSDs in the cluster will report it as down and initiate recovery to maintain adequate redundant copies of data.