Understanding Data Export

You can export data from Cloud storage and Hive tables to relational databases such as MySQL, Postgres, Vertica and others. QDS uses Sqoop to export data.

Export Data

  1. Create a Data Store if you haven’t already done so.
  2. Navigate to the Analyze page and choose Data Export from the drop-down menu at the top of the right pane.
  3. Select your data store, and choose the export mode, export directory, and other options. More details below.
  4. Choose Run from the menu bar at the top right of the page.

Details

Export Modes:

  • Directory Export: Specify which Cloud storage directory to export to the data store destination. Note that this is a non recursive export. You can specify the Field Delimiter (default is ^A).
  • Hive Table Export: Specify which Hive table to export to the data store destination. QDS supports exporting all Hive table formats: text serde, ORC (Optimized Row Columnar File), and RC (Record Columnar File).
    • If the Hive table has no partitions, all the data in the Hive table is exported.
    • If the Hive table has partitions, specify a value for each partition in the Hive Partition Spec field. Example: If the Hive table has partitions date and country with values 2012-01-01 and USA respectively, the partition spec must be: date=2012-01-01/country=USA. You will see an error message if you only specify date=2012-01-01 or country=USA.

Database Insertion Modes

QDS supports append and update modes for all the supported databases as well as upsert mode for MySQL. In update mode, you must also specify the update keys as they are used by the update query.

Using Redshift with AWS S3

For Redshift data stores, use the Copy command (as recommended by AWS) to export data from S3 to Redshift.

This does not work if the Redshift Database and S3 location are in different regions. Qubole automatically falls back to normal insert queries if the Copy command fails, but this can be significantly slower. Under these circumstances Hive Table Export will fail because it is not possible to export the partition columns via Copy since they are not part of the data, and fallback to insert-based exports fails because of the schema mismatch.

Note

In case of an IAM-Roles-based account, attach the same Role ARN to the Redshift instance that has been granted to Qubole.

Extending the Data Export Command

QDS allows you to build more complex systems on top of the existing functionality. You can export custom columns, custom serdes, or multiple partitions. Let us take an AWS example in which you want to export only a few columns from a Hive table, or you want to export many partitions in single command:

  • Select a Workflow in the Compose Tab on the Analyze page.
  • Select Hive command as the first sub-command of the workflow. Write the custom Hive query whose result you want to export. You can write this query either as insert overwrite directory s3://… or as create table as …
  • Select Data Export as the second sub-command of the Workflow. Use Hive Table Export or Directory export depending on whether you used create table or insert overwrite directory in the previous step.
  • Optionally select Hive command as the last sub-command of the Workflow to cleanup S3 using a drop table command.