JOIN Optimizations
Presto on Qubole (version 0.208 and later) has the ability to do stats-based determination of the JOIN distribution type (between BROADCAST and PARTITIONED) and JOIN reordering by the following methods:
Using the Hive Metastore
When table statistics are present in the Hive metastore, Presto’s cost-based-optimizer tries to optimize a query plan by choosing the right type of the JOIN implementation on the basis of memory, CPU, and network cost for every JOIN node in the plan. As the schema evolves, statistics must be generated, maintained, and updated for correct estimates. To address the collection and maintenance of the statistics, Qubole provides an Automatic Stats Collection framework.
For using the automatic JOIN type determination, you can set:
join-distribution-type=AUTOMATIC
inconfig.properties
at the cluster leveljoin_distribution_type=AUTOMATIC
in session properties
Note
This feature cannot be used if the property distributed-join
is already set in the session or cluster
config.properties
.
For using the automatic JOIN reordering, you can set:
optimizer.join-reordering-strategy=AUTOMATIC
inconfig.properties
at the cluster leveljoin_reordering_strategy=AUTOMATIC
in session properties
Using Table Size
Qubole has also introduced the notion of estimating table statistics on the basis of the table’s size on the storage layer. It is useful in cases where statistics for tables are not available. If Presto on Qubole is unable to find table statistics, it can fetch the size of the table on the storage layer and estimate the size and the number of rows in the table. This estimate can currently be used to determine the JOIN distribution type and reordering of tables in a multi-JOIN scenario.
Note
This feature is part of Gradual Rollout and it is only available for Hive tables.
For enabling table size based stats to determine join distribution type, you can set:
At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.
config.properties: join-distribution-type=AUTOMATIC enable-file-size-stats-join-type=true
In a query, as session properties
join_distribution_type=AUTOMATIC enable_file_size_stats_join_type=true
For enabling table size-based stats to determine JOIN order, you can set:
At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.
config.properties: optimizer.join-reordering-strategy=AUTOMATIC join-distribution-type=AUTOMATIC enable-file-size-stats-join-reorder=true
In a query as session properties
join_reordering_strategy=AUTOMATIC join_distribution_type=AUTOMATIC enable_file_size_stats_join_reorder=true
Configuring Timeout for Fetching Hive Table Size
hive.table-size-listing-timeout
is the property that you can use to set the timeout for listing Hive table sizes.
The determination of a table size on the storage layer is performed through a listing on the table’s storage location. The
listing operation is bound by a timeout to avoid any significant delays in the query execution time. This configuration
controls the timeout for the listing operation.
An example usage of this configuration is adding hive.table-size-listing-timeout=2s
to the Hive catalog properties.
It would mean that the listing operation on a table’s storage location is bound to complete in 2 seconds. If it does not
finish within the timeout, the table is considered to be very large.
Optimizing Null Filters in JOINs
Optimization is added to push null filters to table scans by inferring them from the JOIN criteria of equi-joins and
semi-joins in Presto version 317 and later. You can enable it through optimize-nulls-in-joins
as a Presto cluster override
or optimize_nulls_in_join
as a session property. Use this enhancement to reduce the cost of performing JOIN operations when
JOIN columns contain a significant number of NULLs.
Metrics for JOINS in a Query
The distribution type of JOINS in a query is visible in the Presto query info under the joinDistributionStats
key name.