
上QQ阅读APP看书,第一时间看更新
RDDs
RDDs are an immutable and distributed collection of records partitioned across the worker nodes in a Spark cluster. They offer fault tolerance since, in the event of non-functional nodes or damaged partitions, RDD partitions can be recomputed, since all the dependency information needed to replicate each of its partitions is stored by the RDD itself. They also provide consistency, since each partition is immutable. RDDs are commonly used in Spark today in situations where you do not need to impose a schema when processing the data, such as when unstructured data is processed. Operations may be executed on RDDs via a low-level API that provides two broad categories of operation:
- Transformations: These are operations that return another RDD. Narrow transformations, for example, map operations, are operations that can be executed on arbitrary partitions of data without being dependent on other partitions. Wide transformations, for example, sorting, joining, and grouping, are operations that require the data to be repartitioned across the cluster. Certain transformations, such as sorting and grouping, require the data to be redistributed across partitions, a process known as shuffling. Because data needs to be redistributed, wide transformations requiring shuffling are expensive operations and should be minimized in Spark applications where possible.
- Actions: These are computational operations that return a value back to the driver, not another RDD. RDDs are said to be lazily evaluated, meaning that transformations are only computed when an action is called.