Presto Best Practices

This section describes some best practices for Presto queries.

Note

Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

ORC Format

Qubole recommends that you use ORC file format; ORC outperforms text format considerably. For example, suppose you have you have a table nation in delimited form partitioned on column p. You can create the ORC version using this DDL as a Hive query.

DROP table if exists nation_orc;
CREATE table nation_orc like nation;
ALTER table nation_orc set fileformat orc;
ALTER table nation_orc set tblproperties ("orc.compress"="SNAPPY");
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO table nation_orc partition(p) SELECT * FROM nation;

If the nation table is not partitioned, replace the last 3 lines with the following:

INSERT INTO table nation_orc SELECT * FROM nation;

You can run queries against the newly generated table in Presto, and you should see a big difference in performance.

Sorting

ORC format supports skipping reading portions of files if the data is sorted (or mostly sorted) on the filtering columns. For example, if you expect that queries are often going to filter data on column n_name, you can include a SORT BY clause when using Hive to insert data into the ORC table; for example:

INSERT INTO table nation_orc partition(p) SELECT * FROM nation SORT BY n_name;

This helps with queries such as the following:

SELECT count(*) FROM nation_orc WHERE n_name=’AUSTRALIA’;

Specify JOIN Ordering

Presto does automatic JOIN re-ordering only when the feature is enabled. For more information, see Specifying JOIN Reordering. Otherwise, you need to make sure that smaller tables appear on the right side of the JOIN keyword. For example, if table A is larger than table B, write a JOIN query as follows:

SELECT count(*) FROM A JOIN B on (A.a=B.b)

A bad JOIN command can slow down a query as the hash table is created on the bigger table, and if that table does not fit into memory, it can cause out-of-memory (OOM) exceptions.

Specifying JOIN Reordering

Presto supports JOIN Reordering based on table statistics. It enables ability to pick optimal order for joining tables and it only works with INNER JOINS. This configuration is supported only in Presto 0.180 and later versions. Presto 0.208 (beta) has the open-source version of JOIN Reordering.

Note

As a prerequisite before using JOIN Rerodering, ensure that the table statistics must be collected for all tables that are in the query.

Enable the JOIN Reordering feature in Presto 0.180 and 0.193 versions (these properties do not hold good to Presto 0.208 (beta)):

  • As a session-level property by setting qubole_reorder_joins = true.
  • As a Presto override in the Presto cluster by setting qubole-reorder-joins=true. It requires a cluster restart for the configuration to be effective.

Enable the JOIN Reordering feature in Presto 0.208 (beta) version by setting the reordering strategy and the number of reordered joins, which are described here:

  • optimizer.join-reordering-strategy: It accepts a string value and the accepted values are:

    • AUTOMATIC: It enumerates possible orders and uses statistics-based cost estimation for finding out the cost order which is the lesser compared to others. If the statistics are unavailable or if the computing cost fails, then ELIMINATE_CROSS_JOINS is used.
    • ELIMINATE_CROSS_JOINS: It is the default strategy and it reorders JOINs to remove cross JOINS where otherwise this strategy maintains the original query order. It also tries maintaining the original table order.
    • NONE: It indicates that the order the tables listed in the query is maintained.

    The equivalent session property is join_reordering_strategy.

  • optimizer.max-reordered-joins: It is the maximum number of joins that can be reordered at a time when optimizer.join-reordering-strategy is set to a cost-based value. Its default value is 9.

    Warning

    You should be cautious while increasing this property’s value as it can result in performance issues. The number of possible JOIN orders increases with the number of relations.

Enabling Dynamic Filter

Qubole supports the Dynamic Filter feature. It is a join optimization to improve performance of JOIN queries. It has been introduced to optimize Hash JOINs in Presto which can lead to significant speedup in relevant cases. It is not enabled by default. This configuration is supported only in Presto 0.180 and later versions. Enable the Dynamic Filter feature:

  • As a session-level property by using this property, set session dynamic_filtering = true.
  • As a Presto override in the Presto cluster by setting experimental.dynamic-filtering-enabled=true. It requires a cluster restart for the configuration to be effective.

Avoiding Stale Caches

It’s useful to tweak the cache parameters if you expect data to change rapidly. See catalog/hive.properties for more information.

For example, if a Hive table adds a new partition, it takes Presto 20 minutes to discover it. If you plan on changing existing files in the Cloud, you may want to make fileinfo expiration more aggressive. If you expect new files to land in a partition rapidly, you may want to reduce or disable the dirinfo cache.

Compressing Data Writes Through CTAS and INSERT Queries

Data writes can be compressed only when the target format is HiveIgnoreKeyTextOutputFormat. As INSERT OVERWRITE/INTO DIRECTORY uses HiveIgnoreKeyTextOutputFormat, the data written through it can also be compressed by setting the session-level property and codec. All SELECT queries with LIMIT > 1000 are converted into INSERT OVERWRITE/INTO DIRECTORY.

The INSERT OVERWRITE DIRECTORY command accepts a custom delimiter, which must be an ASCII value. You can specify the ASCII values using double quotes, for example, "," or as a binary literal such as X'AA'.

Here is the syntax to specify a custom delimiter.

insert overwrite directory 's3://sample/defloc/presto_query_result/1/' DELIMITED BY <Custom delimiter> SELECT * FROM default.sample_table;

The <Custom delimiter> parameter does not accept multiple characters and non-ASCII characters as the parameter value.

Configuring Data Writes Compression in Presto

To compress data written from CTAS and INSERT queries to cloud directories, set hive.compression-codec in the Override Presto Configuration field under the Clusters > Advanced Configuration UI page. Set the compression codec under catalog/hive.properties as illustrated below.

catalog/hive.properties:
hive.compression-codec=GZIP

QDS supports the following compression codecs:

  • GZIP (default codec)
  • NONE (used when no compression is required)
  • SNAPPY

See Understanding the Presto Engine Configuration for more information on how to override the Presto configuration. For more information on the Hive connector, see Hive Connector.

When the codec is set, data writes from a successful execution of a CTAS/INSERT Presto query are compressed as per the compression-codec set and stored in the cloud. To see the file content, navigate to Explore in the QDS UI and select the file under the My Amazon S3 or My Blob tab.

QDS translates the file into a readable format. See Exploring Data in the Cloud for more information.

Ignoring Corrupt Records in a Presto Query

Presto has added a new Hive connector configuration, hive.skip-corrupt-records to skip corrupt records in input formats other than orc, parquet and rcfile. It is set to false by default on a Presto cluster. Set hive.skip-corrupt-records=true for all queries on a Presto cluster to ignore corrupt records. This configuration is supported only in Presto 0.180 and later versions.

You can also set it as a session property as hive.skip_corrupt_records=true in a session when the active cluster does not have this configuration globally enabled.

Note

The behavior for the corrupted file is non-deterministic, that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record-reader returns whatever it read until this point and skips the rest of the file.

Using the Custom Event Listener

Event listeners are invoked on each query creation, completion, and split completion. An event listener enables the development of the query performance and analysis plugins. Qubole Presto uses its own event listener but it supports integration of an external Presto Event Listener. This integration is not supported by default. Create a ticket with Qubole Support to enable this feature of integrating a custom event listener.

At a given point of time, only a single event listener can be active in a Presto cluster.

After the feature is enabled, perform these steps:

  1. Modify a Custom Event Listener to add logs in the JSON format to the AWS S3 location.
  2. Build a JAR file. An example is available in QueryEventListener Jar.
  3. Place the JAR file in the AWS S3 location.
  4. Add this code that is available in the bootstrap template in the Presto cluster’s node bootstrap.
  5. Restart the cluster to make the custom event listener active.