Presto Best Practices

This section describes some best practices for Presto queries and it covers:

ORC Format

Qubole recommends that you use ORC file format; ORC outperforms text format considerably. For example, suppose you have you have a table nation in delimited form partitioned on column p. You can create the ORC version using this DDL as a Hive query.

DROP table if exists nation_orc;
CREATE table nation_orc like nation;
ALTER table nation_orc set fileformat orc;
ALTER table nation_orc set tblproperties ("orc.compress"="SNAPPY");
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO table nation_orc partition(p) SELECT * FROM nation;

If the nation table is not partitioned, replace the last 3 lines with the following:

INSERT INTO table nation_orc SELECT * FROM nation;

You can run queries against the newly generated table in Presto, and you should see a big difference in performance.

Sorting

ORC format supports skipping reading portions of files if the data is sorted (or mostly sorted) on the filtering columns. For example, if you expect that queries are often going to filter data on column n_name, you can include a SORT BY clause when using Hive to insert data into the ORC table; for example:

INSERT INTO table nation_orc partition(p) SELECT * FROM nation SORT BY n_name;

This helps with queries such as the following:

SELECT count(*) FROM nation_orc WHERE n_name=’AUSTRALIA’;

Specify JOIN Ordering

Presto does automatic JOIN re-ordering only when the feature is enabled. For more information, see Specifying JOIN Reordering. Otherwise, you need to make sure that smaller tables appear on the right side of the JOIN keyword. For example, if table A is larger than table B, write a JOIN query as follows:

SELECT count(*) FROM A JOIN B on (A.a=B.b)

A bad JOIN command can slow down a query as the hash table is created on the bigger table, and if that table does not fit into memory, it can cause out-of-memory (OOM) exceptions.

Specifying JOIN Reordering

Presto supports JOIN Reordering based on table statistics. It enables ability to pick optimal order for joining tables and it only works with INNER JOINS. This configuration is supported only in Presto 0.180 and later versions. Presto 0.208 has the open-source version of JOIN Reordering.

Note

As a prerequisite before using JOIN Reordering, ensure that the table statistics must be collected for all tables that are in the query.

Enable the JOIN Reordering feature in Presto 0.180 and 0.193 versions (these properties do not hold good to Presto 0.208):

  • As a session-level property by setting qubole_reorder_joins = true.

  • As a Presto override in the Presto cluster by setting qubole-reorder-joins=true. It requires a cluster restart for the configuration to be effective.

Enable the JOIN Reordering feature in Presto 0.208 version by setting the reordering strategy and the number of reordered joins, which are described here:

  • optimizer.join-reordering-strategy: It accepts a string value and the accepted values are:

    • AUTOMATIC: It enumerates possible orders and uses statistics-based cost estimation for finding out the cost order which is the lesser compared to others. If the statistics are unavailable or if the computing cost fails, then ELIMINATE_CROSS_JOINS is used.

    • ELIMINATE_CROSS_JOINS: It is the default strategy and it reorders JOINs to remove cross JOINS where otherwise this strategy maintains the original query order. It also tries maintaining the original table order.

    • NONE: It indicates that the order the tables listed in the query is maintained.

    The equivalent session property is join_reordering_strategy.

  • optimizer.max-reordered-joins: It is the maximum number of joins that can be reordered at a time when optimizer.join-reordering-strategy is set to a cost-based value. Its default value is 9.

    Warning

    You should be cautious while increasing this property’s value as it can result in performance issues. The number of possible JOIN orders increases with the number of relations.

Cost-based optimization (CBO) for JOIN reordering and JOIN distribution type selection using statistics present in the Hive metastore is enabled by default for Presto version 0.208.

The following values are added to default cluster configuration for Presto version 0.208.

optimizer.join-reordering-strategy=AUTOMATIC
join-distribution-type=AUTOMATIC
join-max-broadcast-table-size=100MB

Enabling Dynamic Filter

Qubole supports the Dynamic Filter feature. It is a join optimization to improve performance of JOIN queries. It has been introduced to optimize Hash JOINs in Presto which can lead to significant speedup in relevant cases.

It is not enabled by default. Enable the Dynamic Filter feature as a session-level property using one of these commands based on the Presto version:

  • Set session dynamic_filtering = true in Presto 0.208 and earlier versions (earliest supported version is 0.180).

  • Set session enable_dynamic_filtering = true in Presto 317.

Enable the Dynamic Filter feature as a Presto override in the Presto cluster using one of these commands based on the Presto version:

  • Set experimental.dynamic-filtering-enabled=true in Presto 0.208 and earlier versions (earliest supported version is 0.180). It requires a cluster restart for the configuration to be effective.

  • Set experimental.enable-dynamic-filtering=true in Presto 317. It requires a cluster restart for the configuration to be effective.

Note

Qubole has introduced a feature to enable dynamic partition pruning for join queries on partitioned columns in Hive tables at account level. It is part of Gradual Rollout.

Qubole has added a configuration property, hive.max-execution-partitions-per-scan to limit the maximum number of partitions that a table scan is allowed to read during a query execution. hive.max-partitions-per-scan limits the the number of partitions per table scan during the planning stage before a query execution begins.

Qubole has extended the dynamic filter optimization to semi-join to take advantage of a selective build side in queries with the IN clause.

Example: SELECT COUNT(*) from store_sales where ss_sold_date_sk IN (SELECT s_closed_date_sk from store);

Reducing Data Scanned on the Probe Side

Dynamic filters are pushed down to ORC and Parquet readers to reduce data scanned on the probe side for partitioned as well as non-partitioned tables. Dynamic filters pushed down to ORC and Parquet readers are more effective in filtering data when it is ordered by JOIN keys.

Example: In the following query, ordering store_sales_sorted by ss_sold_date_sk during the ingestion immensely improves the effectiveness of dynamic filtering.

SELECT COUNT(*) from store_sales_sorted ss, store s where ss.ss_sold_date_sk = s.s_closed_date_sk;

Avoiding Stale Caches

It’s useful to tweak the cache parameters if you expect data to change rapidly. See catalog/hive.properties for more information.

For example, if a Hive table adds a new partition, it takes Presto 20 minutes to discover it. If you plan on changing existing files in the Cloud, you may want to make fileinfo expiration more aggressive. If you expect new files to land in a partition rapidly, you may want to reduce or disable the dirinfo cache.

Compressing Data Writes Through CTAS and INSERT Queries

Data writes can be compressed only when the target format is HiveIgnoreKeyTextOutputFormat. As INSERT OVERWRITE/INTO DIRECTORY uses HiveIgnoreKeyTextOutputFormat, the data written through it can also be compressed by setting the session-level property and codec. All SELECT queries with LIMIT > 1000 are converted into INSERT OVERWRITE/INTO DIRECTORY.

Presto returns the number of files written during a INSERT OVERWRITE DIRECTORY (IOD) query execution in QueryInfo. The Presto client in Qubole Control Plane later uses this information to wait for the returned number of files at the IOD location to be displayed. It fixes the eventual consistency issues while reading query results through the QDS UI.

The INSERT OVERWRITE DIRECTORY command accepts a custom delimiter, which must be an ASCII value. You can specify the ASCII values using double quotes, for example, "," or as a binary literal such as X'AA'.

Here is the syntax to specify a custom delimiter.

insert overwrite directory 's3://sample/defloc/presto_query_result/1/' DELIMITED BY <Custom delimiter> SELECT * FROM default.sample_table;

The <Custom delimiter> parameter does not accept multiple characters and non-ASCII characters as the parameter value.

Configuring Data Writes Compression in Presto

To compress data written from CTAS and INSERT queries to cloud directories, set hive.compression-codec in the Override Presto Configuration field under the Clusters > Advanced Configuration UI page. Set the compression codec under catalog/hive.properties as illustrated below.

catalog/hive.properties:
hive.compression-codec=GZIP

QDS supports the following compression codecs:

  • GZIP (default codec)

  • NONE (used when no compression is required)

  • SNAPPY

See Understanding the Presto Engine Configuration for more information on how to override the Presto configuration. For more information on the Hive connector, see Hive Connector.

When the codec is set, data writes from a successful execution of a CTAS/INSERT Presto query are compressed as per the compression-codec set and stored in the cloud.

To see the file content, navigate to Explore in the QDS UI and select the file under the My Amazon S3 or My Blob tab. QDS translates the file into a readable format. See Exploring Data in the Cloud for more information.

Ignoring Corrupt Records in a Presto Query

Presto has added a new Hive connector configuration, hive.skip-corrupt-records to skip corrupt records in input formats other than orc, parquet and rcfile. It is set to false by default on a Presto cluster. Set hive.skip-corrupt-records=true for all queries on a Presto cluster to ignore corrupt records. This configuration is supported only in Presto 0.180 and later versions.

You can also set it as a session property as hive.skip_corrupt_records=true in a session when the active cluster does not have this configuration globally enabled.

Note

The behavior for the corrupted file is non-deterministic, that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record-reader returns whatever it read until this point and skips the rest of the file.

Using the Custom Event Listener

Event listeners are invoked on each query creation, completion, and split completion. An event listener enables the development of the query performance and analysis plugins.

At a given point of time, only a single event listener can be active in a Presto cluster.

Perform these steps to install an event listener in the Presto cluster:

  1. Create an event listener. You can use this Presto event listener as a template.

  2. Build a JAR file and upload it to the cloud object store. For example, let us use s3://presto/plugins/event-listener.jar as the cloud object storage location.

  3. Download event-listener.jar on the Presto cluster using the Presto Server Bootstrap. You can add the Presto bootstrap properties as Presto overrides in the Presto cluster to download the JAR file. For downloading event-listener.jar, pass the following bootstrap properties as Presto overrides through the Override Presto Configuration UI option in the cluster’s Advanced Configuration tab.

    bootstrap.properties:
    
    mkdir /usr/lib/presto/plugin/event-listener
    cd /usr/lib/presto/plugin/event-listener
    hadoop fs -get s3://presto/plugins/event-listener.jar
    
  4. Configure Presto to use the event-listener through the Override Presto Configuration UI option in the cluster’s Advanced Configuration tab as shown below.

    event-listener.properties:
    event-listener.name=my-event-listener
    
  5. Restart the Presto cluster.

Proactively Removing Unhealthy Cluster Nodes

The configuration property for removing unhealthy cluster nodes, ascm.bad-node-removal, is set to true by default to maintain better cluster health. You can disable it by passing ascm.bad-node-removal=false as a Presto cluster override.

By default, this service runs periodically every minute. You can set a different time by changing its value using the ascm.bad-node-removal.interval configuration property.

This feature identifies unhealthy worker nodes based on different triggers and gracefully shuts down such unhealthy nodes. The various triggers are:

  • Worker nodes with a ratio of the open file descriptor count to the maximum file descriptor count higher than the threshold value defaulting to 0.9. You can set the threshold value using the ascm.bad-node-removal.file-descriptor-max-threshold parameter, and its supported value range is 0.0 - 1.0.

  • Worker nodes whose disk space usage ratio is greater than a threshold value defaulting to 0.95. You can set the threshold value using the ascm.bad-node-removal.disk-usage-max-threshold parameter, and its supported value range is 0.0 - 1.0. Pass ascm.bad-node-removal.disk-space-usage-policy=false as a Presto cluster override if you want to disable the disk space usage trigger.