Understanding Data Import

You can import data from relational databases such as MySql, Postgres, Vertica and Redshift as well as non-relational databases such as MongoDB. QDS uses Sqoop to import data.

Note

You can see the list of supported features in the Sqoop User Guide. Though it is mentioned in the Sqoop User Guide, Data Import job doesn’t support Incremental Imports.

Procedure

  1. Create a Data Store if you haven’t already done so.
  2. Navigate to the Analyze page in the QDS UI and choose Data Import from the drop-down menu at the top of the right pane.
  3. Select your data store, and choose the table, extraction mode, and other options. More details below.
  4. Choose Run from the menu bar at the top right of the page.

Details

There are two extraction modes:

  • Simple
  • Advanced

Simple

In this mode, you can choose the input and output tables, the columns to extract, a where clause, the output format, a partition specification for the Hive table, and other options.

Advanced

In this mode, you can specify a free form query. For example, you can join multiple tables and extract the (denormalized) data. But if you want parallel extracts, a ‘$CONDITIONS’ macro must be part of a where clause in the query (see below).

Controlling Parallelism

Sqoop can parallelize database extracts. By default, it automatically figures out the level of parallelism and performs up to 4 extracts in parallel. But you can specify the level of parallelism explicitly. Most importantly, the level of parallelism can be set to 1; this disables parallel extracts.

In Simple mode, Sqoop tries to parallelize extracts using an integer primary key (if one is available).

In Advanced mode, you have more control over how to parallelize the extracts. You can specify a column to parallelize on (this is the column that must be visible in the scope of the ‘$CONDITIONS’ macro embedded inside the free form query). You must also specify a boundary query (--boundary-query) that returns the minimum and maximum values of the column to parallelize on. Qubole strongly encourages you to read the Sqoop documentation on free-form imports (for example, the relevant chapter in the Apache Sqoop Cookbook) to understand these controls.