Understanding Data Import¶
You can import data from relational databases such as MySql, Postgres, Vertica and Redshift as well as non-relational databases such as MongoDB. QDS uses Sqoop to import data.
- Create a Data Store if you haven’t already done so.
- Navigate to the Analyze page in the QDS UI and choose Data Import from the drop-down menu at the top of the right pane.
- Select your data store, and choose the table, extraction mode, and other options. More details below.
- Choose Run from the menu bar at the top right of the page.
There are two extraction modes:
In this mode, you can choose the input and output tables, the columns to extract, a
the output format, a partition specification for the Hive table, and other options.
In this mode, you can specify a free form query. For example, you can join multiple tables and extract the
(denormalized) data. But if you want parallel extracts, a ‘$CONDITIONS’ macro
must be part of a
where clause in the query (see below).
Sqoop can parallelize database extracts. By default, it automatically figures out the level of parallelism and performs up to 4 extracts in parallel. But you can specify the level of parallelism explicitly. Most importantly, the level of parallelism can be set to 1; this disables parallel extracts.
In Simple mode, Sqoop tries to parallelize extracts using an integer primary key (if one is available).
In Advanced mode, you have more control over how to parallelize the extracts.
You can specify a column to parallelize on (this is the column that
must be visible in the scope of the ‘$CONDITIONS’ macro embedded inside
the free form query). You must also specify a boundary query (
--boundary-query) that returns the minimum and
maximum values of the column to parallelize on. Qubole strongly encourages you to read the Sqoop
documentation on free-form imports (for example, the relevant chapter in the Apache Sqoop Cookbook) to understand