JOIN Optimizations

Presto on Qubole (version 0.208 and later) has the ability to do stats-based determination of the JOIN distribution type (between BROADCAST and PARTITIONED) and JOIN reordering by the following methods:

Using the Hive Metastore
Using Table Size

Using the Hive Metastore

When table statistics are present in the Hive metastore, Presto’s cost-based-optimizer tries to optimize a query plan by choosing the right type of the JOIN implementation on the basis of memory, CPU, and network cost for every JOIN node in the plan. As the schema evolves, statistics must be generated, maintained, and updated for correct estimates. To address the collection and maintenance of the statistics, Qubole provides an Automatic Stats Collection framework.

For using the automatic JOIN type determination, you can set:

join-distribution-type=AUTOMATIC in config.properties at the cluster level
join_distribution_type=AUTOMATIC in session properties

Note

This feature cannot be used if the property distributed-join is already set in the session or cluster config.properties.

For using the automatic JOIN reordering, you can set:

optimizer.join-reordering-strategy=AUTOMATIC in config.properties at the cluster level
join_reordering_strategy=AUTOMATIC in session properties

Using Table Size

Qubole has also introduced the notion of estimating table statistics on the basis of the table’s size on the storage layer. It is useful in cases where statistics for tables are not available. If Presto on Qubole is unable to find table statistics, it can fetch the size of the table on the storage layer and estimate the size and the number of rows in the table. This estimate can currently be used to determine the JOIN distribution type and reordering of tables in a multi-JOIN scenario.

Note

This feature is part of Gradual Rollout and it is only available for Hive tables.

For enabling table size based stats to determine join distribution type, you can set:

At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.
```
config.properties:
join-distribution-type=AUTOMATIC
enable-file-size-stats-join-type=true
```

In a query, as session properties

join_distribution_type=AUTOMATIC
enable_file_size_stats_join_type=true

For enabling table size-based stats to determine JOIN order, you can set:

At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.

config.properties:
optimizer.join-reordering-strategy=AUTOMATIC
join-distribution-type=AUTOMATIC
enable-file-size-stats-join-reorder=true

In a query as session properties

join_reordering_strategy=AUTOMATIC
join_distribution_type=AUTOMATIC
enable_file_size_stats_join_reorder=true

Configuring Timeout for Fetching Hive Table Size

hive.table-size-listing-timeout is the property that you can use to set the timeout for listing Hive table sizes. The determination of a table size on the storage layer is performed through a listing on the table’s storage location. The listing operation is bound by a timeout to avoid any significant delays in the query execution time. This configuration controls the timeout for the listing operation.

An example usage of this configuration is adding hive.table-size-listing-timeout=2s to the Hive catalog properties. It would mean that the listing operation on a table’s storage location is bound to complete in 2 seconds. If it does not finish within the timeout, the table is considered to be very large.

Optimizing Null Filters in JOINs

Optimization is added to push null filters to table scans by inferring them from the JOIN criteria of equi-joins and semi-joins in Presto version 317 and later. You can enable it through optimize-nulls-in-joins as a Presto cluster override or optimize_nulls_in_join as a session property. Use this enhancement to reduce the cost of performing JOIN operations when JOIN columns contain a significant number of NULLs.

Metrics for JOINS in a Query

The distribution type of JOINS in a query is visible in the Presto query info under the joinDistributionStats key name.