Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.
You may want to write results of a query into another Hive table or to a Cloud location. QDS
Presto supports inserting data into (and overwriting) Hive tables and Cloud directories, and provides an
command for this purpose. It is currently available only in QDS; Qubole is in the process of contributing it to
Keep in mind that Hive is a better option for large scale ETL workloads when writing terabytes of data; Presto’s insertion capabilities are better suited for tens of gigabytes.
INSERT syntax is very similar to Hive’s INSERT syntax.
Let us use
default_qubole_airline_origin_destination as the source table in the examples that follow; it contains
flight itinerary information.
You can write the result of a query directly to Cloud storage in a delimited format; for example:
INSERT INTO directory '<scheme>qubole.com-siva/experiments/quarterly_breakdown' SELECT origin, quarter, count(*) AS c FROM default_qubole_airline_origin_destination WHERE YEAR='2007' GROUP BY quarter, origin;
<scheme> is the Cloud-specific URI scheme:
s3:// for AWS;
abfs[s]:// for Azure.
Here is a preview of what the result file looks like using
cat -v. Fields in the results are
"DFW"^A1^A334973 "LAX"^A1^A216789 "OXR"^A1^A456 "HNL"^A1^A78271 "IAD"^A1^A115924 "ALB"^A1^A20779 "ORD"^A1^A414078
Simple Hive Tables¶
The target Hive table can be delimited, CSV, ORC, or RCFile. Qubole does not support inserting into Hive tables using custom input formats and serdes. You can create a target table in delimited format using the following DDL in Hive.
CREATE TABLE quarter_origin (quarter string, origin string, count int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
desc quarter_origin to confirm that the table is familiar to Presto. It can take up to 2 minutes for Presto to
pick up a newly created table in Hive. Now run the following insert statement as a Presto query.
INSERT INTO TABLE quarter_origin SELECT quarter, origin, count(*) FROM default_qubole_airline_origin_destination WHERE YEAR='2007' GROUP BY quarter, origin;
You can now run queries against
quarter_origin to confirm that the data is in the table.
SELECT * FROM quarter_origin LIMIT 5;
Similarly, you can overwrite data in the target table by using the following query.
INSERT OVERWRITE TABLE quarter_origin SELECT quarter, origin, count(*) FROM default_qubole_airline_origin_destination WHERE YEAR='2007' GROUP BY quarter, origin;
Partitioned Hive Tables¶
You can also partition the target Hive table; for example (run this in Hive):
CREATE TABLE quarter_origin_p (origin string, count int) PARTITIONED BY (quarter string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
Now you can insert data into this partitioned table in a similar way. The only catch is that the partitioning column
must appear at the very end of the select list. In the below example, the column
quarter is the partitioning column.
INSERT INTO TABLE quarter_origin_p SELECT origin, count(*), quarter FROM default_qubole_airline_origin_destination WHERE YEAR='2007' GROUP BY quarter, origin;
Note that the partitioning attribute can also be a constant. You can use
overwrite instead of
into to erase
previous content in partitions.
Configuring the Concurrent Writer Tasks Per Query¶
Use this configuration judiciously to prevent overloading the cluster due to excessive resource utilization. So it is recommended to use higher value through session properties for queries which generate bigger outputs. For example, ETL jobs.
Presto provides a configuration property to define the per-node-count of Writer tasks for a query. You can set it at a
cluster level and a session level. By default, when inserting data through
CREATE TABLE AS SELECT
operations, one Writer task per worker node is created which can slow down the query if there there is a lot of data that
needs to be written. In such cases, you can use the
task_writer_count session property but you must set its value in
power of 2 to increase the number of Writer tasks per node. This eventually speeds up the data writes.
The cluster-level property that you can override in the cluster is
task.writer-count. You must set its value in power