Data Import

Import data from relational databases such as MySql, Postgres, Vertica and Redshift as well as non-relational databases such as MongoDB. QDS uses Sqoop to import data. First, create a Data Store and set it as a source.

Controls are grouped into two extraction modes:

  • Simple
  • Advanced

Simple

In this mode, one can specify a table name (present in the DbTap), list of columns, and a where clause.

Advanced

In this mode, a free form query can be specified. For example, the user can even join multiple tables and extract the (denormalized) data. However, if parallel extracts are desired, then a ‘$CONDITIONS’ macro must be part of a where clause in the query (see below).

Controlling Parallelism

Sqoop can parallelize database extracts. By default, it automatically figures out the level of parallelism and perform up to 4 extracts in parallel. However, the user can specify the level of parallelism explicitly. Most importantly, the level of Parallelism can be set to 1 and this disables parallel extracts.

In the Simple mode, Sqoop tries to parallelize extracts using an integer primary key (if one is available). In the Advanced mode, the user has more control over specifying how to parallelize the extracts. A user can specify a column to parallelize on (this is the column that must be visible in the scope of the ‘$CONDITIONS’ macro embedded inside the free form query). Finally, the user must specify another query called the Boundary Query that returns the minimum and the maximum of the column to parallelize on. Qubole strongly encourages reading through the sqoop documentation on Free Form Imports to understand these controls.