Inserting Data
Note
Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.
You may want to write results of a query into another Hive table or to a Cloud location. QDS
Presto supports inserting data into (and overwriting) Hive tables and Cloud directories, and provides an INSERT
command for this purpose. It is currently available only in QDS; Qubole is in the process of contributing it to
open-source Presto.
Keep in mind that Hive is a better option for large scale ETL workloads when writing terabytes of data; Presto’s insertion capabilities are better suited for tens of gigabytes.
The INSERT
syntax is very similar to Hive’s INSERT syntax.
Let us use default_qubole_airline_origin_destination
as the source table in the examples that follow; it contains
flight itinerary information.
Cloud Directories
You can write the result of a query directly to Cloud storage in a delimited format; for example:
INSERT INTO directory '<scheme>qubole.com-siva/experiments/quarterly_breakdown'
SELECT origin,
quarter,
count(*) AS c
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
origin;
<scheme>
is the Cloud-specific URI scheme: s3://
for AWS; wasb[s]://
, adl://
, or abfs[s]://
for Azure.
Here is a preview of what the result file looks like using cat -v
. Fields in the results are ^A
(ASCII code \x01
) separated.
"DFW"^A1^A334973
"LAX"^A1^A216789
"OXR"^A1^A456
"HNL"^A1^A78271
"IAD"^A1^A115924
"ALB"^A1^A20779
"ORD"^A1^A414078
Simple Hive Tables
The target Hive table can be delimited, CSV, ORC, or RCFile. Qubole does not support inserting into Hive tables using custom input formats and serdes. You can create a target table in delimited format using the following DDL in Hive.
CREATE TABLE quarter_origin (quarter string, origin string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Run desc quarter_origin
to confirm that the table is familiar to Presto. It can take up to 2 minutes for Presto to
pick up a newly created table in Hive. Now run the following insert statement as a Presto query.
INSERT INTO TABLE quarter_origin
SELECT quarter,
origin,
count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
origin;
You can now run queries against quarter_origin
to confirm that the data is in the table.
SELECT *
FROM quarter_origin LIMIT 5;
Similarly, you can overwrite data in the target table by using the following query.
INSERT OVERWRITE TABLE quarter_origin
SELECT quarter,
origin,
count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
origin;
Partitioned Hive Tables
You can also partition the target Hive table; for example (run this in Hive):
CREATE TABLE quarter_origin_p (origin string, count int)
PARTITIONED BY (quarter string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Now you can insert data into this partitioned table in a similar way. The only catch is that the partitioning column
must appear at the very end of the select list. In the below example, the column quarter
is the partitioning column.
INSERT INTO TABLE quarter_origin_p
SELECT origin,
count(*),
quarter
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
origin;
Note that the partitioning attribute can also be a constant. You can use overwrite
instead of into
to erase
previous content in partitions.
Configuring the Concurrent Writer Tasks Per Query
Caution
Use this configuration judiciously to prevent overloading the cluster due to excessive resource utilization. So it is recommended to use higher value through session properties for queries which generate bigger outputs. For example, ETL jobs.
Presto provides a configuration property to define the per-node-count of Writer tasks for a query. You can set it at a
cluster level and a session level. By default, when inserting data through INSERT
OR CREATE TABLE AS SELECT
operations, one Writer task per worker node is created which can slow down the query if there there is a lot of data that
needs to be written. In such cases, you can use the task_writer_count
session property but you must set its value in
power of 2 to increase the number of Writer tasks per node. This eventually speeds up the data writes.
The cluster-level property that you can override in the cluster is task.writer-count
. You must set its value in power
of 2.