Data Lake for Enterprises
上QQ阅读APP看书,第一时间看更新

Advantages of Sqoop

Below are the advantages of Apache Sqoop, which is also the reason for choosing this technology in this layer.

  • Allows the transfer of data with a variety of structured data stores like Postgres, Oracle, Teradata, and so on.
  • Since the data is transferred and stored in Hadoop, Sqoop allows us to offload certain processing done in the ETL (Extract, Load and Transform) process into low-cost, fast, and effective Hadoop processes.
  • Sqoop can execute the data transfer in parallel, so execution can be quick and more cost effective.
  • Helps to integrate with sequential data from the mainframe. This helps not only to limit the usage of the mainframe, but also reduces the high cost in executing certain jobs using mainframe hardware.
  • Data from other structured data stores can be Sqooped into Hadoop, which is mainly for unstructured data stores. This process allows us to combine both types of data for various analysis purposes in a more cost effective and fast manner.
  • Has an extension mechanism, using which different connectors can be built and hooked. This can be used to customize existing connectors, and can also be tweaked according to use case requirements. There are a number of in-built connectors for stores such as MySQL, PostgreSQL, Oracle, and a number of well-known ones. Because of its capability of writing extensions, many companies have written custom connectors that are well supported and enterprise grade. For example, Oracle connector is developed by Quest Software and VoltDB connector is developed by VoltDB itself.
  • In addition to JDBC based connectors (for various RDBMS systems), it also has direct connectors which uses native tools for better performance.
  • Sqoop also supports a variety of file formats such as Avro, Text, and SequenceFile.
Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.

- Wikipedia

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
- Hadoop wiki