Sqoop Part.1 – Data Collection Using Sqoop

big-data-sqoop

From now on, we will introduce Sqoop and the methods transferred data of RDBMS to Hadoop by using Sqoop. And this topic is divided into three parts. Well, first of all, let’s look at the introduction, comparison between Sqoop1 and Sqoop2, and the service environment.

What is Sqoop?

Sqoop stands for “SQL to Hadoop” and Apache Sqoop is an open-source tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.

Data processed by Scoop can be used for MapReduce or Hive. After launching the first version in 2009, it became Apache Top Level Project and is developing constantly. Now Sqoop has two versions with Sqoop1 of client type and Sqoop2 added Server side to Sqoo1.

Compare between Sqoop1 and Sqoop2

The biggest difference is an addition for Server side and the convenience of Apache Oozie and the integration by using HTTP REST.

data-collection-using-sqoop

sqoop1

  •  Client-side Install
  • Connector must be installed at Local
  • JDBC Driver needs the install each conneted Local
  • Offer CLI(Command Line Interface)

sqoop2

  • Client-side Install
  • Possible to install and connect on only one server that needs Connector
  • i.e., JDBC Driver installs one place only
  • Possible to connect through WEB and REST API besides CLI conneting
  • Easy to combine by using Apache Oozie and REST API as Workflow Manager

The Service Environment

  • If store a log in RDBMS such as Oracle, MySQL because it doesn’t have the extra log acquisition system and data storage, that is, need to transfer the overlarge data to storage of the same distribute environment and to analyze it because of cost and time involved.
  • Not only a log, but meta-data is mostly stored in RDBMS. If transfer meta-data to Hadoop, Hive, etc.
  • Otherwise, if transfer the result of analysis by Hadoop, Hive, etc. to the remote RDBMS, not API.