Understanding Automatic Statistics Collection in Qubole¶
Presto, Apache Spark, and Apache Hive can generate more efficient query plans with table statistics. For example, Spark, as of version 2.1.1, performs broadcast joins only if the table size is available in the table statistics stored in the Hive Metastore (see spark.sql.autoBroadcastJoinThreshold). Broadcast joins can have a dramatic impact on the run time of everyday SQL queries where small dimension tables are joined frequently. The Big Bench tuning exercise from Intel reported a 4.22x speedup by turning on broadcast joins for specific queries.
In Qubole Data Service (QDS), all query engines use Hive Metastore as the catalog. If the Hive Metastore contains statistics, then all query engines can use them for query planning as explained above. But table statistics collection is not automatic. One of the goals of the Qubole platform is to apply automation to help users achieve the maximum performance and efficiency. This topic describes how Qubole has automated Table Statistics collection in QDS.
Understanding the Pros and Cons of Apache Hive Table Statistics¶
The Apache Hive Statistics wiki page describes the list of statistics that can be computed and stored in the Hive metastore. Statistics collection is not automated and the pros and cons are described in this table.
|User sets hive.stats.autogather=true to gather statistics automatically during INSERT OVERWRITE queries.||Supported in Hive||
It does not catch direct writes to cloud object storage locations for external tables.
User schedules ANALYZE TABLE COMPUTE STATISTICS
User picks up a few tables as wanted to keep stats updated and then uses Qubole Scheduler (or an external cron job) to run the COMPUTE STATISTICS statement on these tables periodically.
In a bid to overcome the disadvantages of table statistics of open source described above, Qubole has decided to provide automated statistics collection into the QDS platform as a service.
Understanding Qubole Statistics Collection Service¶
These are the requirements for using the collection service:
- Use the
ANALYZE COMPUTE STATISTICSstatement in Apache Hive to collect statistics.
ANALYZEstatements for DML and DDL statements that create tables or insert data on any query engine.
ANALYZEstatements must be transparent and not affect the performance of DML statements.
ANALYZE COMPUTE STATISTICScomes in three flavors in Apache Hive.
The three flavors of
ANALYZE COMPUTE STATISTICS are described in this table.
|SQL Statement Example||Metadata Collected||Impact on Cluster|
||Number of files and physical size in bytes||It is very fast and nearly zero impact on the cluster’s running workloads.|
||Number of files, physical size in bytes, and number of rows||It scans the entire table or partition specified and has moderate impact.|
|ANALYZE TABLE [db_name.]tablename
||Number of distinct values, NULL values, and TRUE and FALSE (BOOLEANS case). Minimum/maximum values and average/maximum length of the column.||It has a significant impact.|
Each flavor requires different compute resources. Qubole allows you to decide the appropriate statistics for each table. currently, you can choose the flavor at the account level. In the future, Qubole plans to allow you to configure the flavor at a table or schema level. Create a ticket with Qubole Support to enable this feature on the QDS account.
This architectural diagram shows how
ANALYZE COMPUTE STATISTICS statements are triggered in QDS.
Here is the command’s computation process:
- A user issues a Hive or Spark command.
- If this command is a DML or DDL statement, the metastore is updated.
- A custom MetastoreEventListener is triggered.
- The command is run just like any other Hive command.
Qubole has seen a significant performance boost for analytic queries in Apache Hive and SparkSQL using table statistics and it is working with the Presto community to incorporate statistics in the Presto query planning as well.
Qubole has recently announced the launch of AIR (Alerts, Insights, Recommendations). Automatic Statistics Collection plays a key role in the Qubole’s autonomous data platform. Some examples of use cases within AIR are:
- Viewing table statistics in the preview.
- The QDS platform can give recommendations and insights on better partitioning or data organization strategies to data engineers, architects, or admins based on statistics.