
Introduction
Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.
Hadoop supported storage means a storage format that can work with Hadoop's InputFormat
and OutputFormat
interfaces. InputFormat
is responsible for creating InputSplits
from input data and piding it further into records. OutputFormat
is responsible for writing to storage.
We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat
interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.
We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.