Data Lake for Enterprises
上QQ阅读APP看书,第一时间看更新

Data import using Sqoop

The import tool within Sqoop when given commands imports individual or all tables from RDBMS using various available connector API’s into HDFS. When importing data, each row in an RDBMS table is imported into HDFS as a record. According to the type of data, it is either stored as text files for text data or as sequence files and Avro files in case of binary data.

The following figure (Figure 06 - our interpretation of Sqoop Export inspired from Apache Sqoop blogs) details the Sqoop import tool functioning by importing data from PostgreSQL to HDFS:

Figure 06: Working of Sqoop Import

Before the actual Sqoop import function executes, the tool analyses the database and forms relevant metadata. The formed metadata is then used to execute the import function from the database of the required table or the whole table as the case may be.

Sqoop does provide different options based on which the import function can take place. The data imported from a table is stored as single or multiple HDFS files (according to size of data from source) in the form of comma separated values (for each column) and each row in the table is separated using a new line. Sqoop also provides options while importing to specify file format (Avro or text files).

Later in this chapter we will actually run you through the Sqoop command which will be used to get data from PostgreSQL to HDFS. This section just gives the import Sqoop functioning and its actual working under the hood.