Hive

The new features and enhancements are:

Other enhancements and bug fixes are listed in:

Automatic Statistics Collections on Hive Queries

QHIVE-4562: Qubole has added the following enhancements to Automatic Statistics Collection from Hive queries:

  • The Automatic Statistics Collection feature is now supported in Hive 3.1.1 (beta).
  • The automated statistics query now runs on the Hive-on-coordinator mode on the maintenance cluster.
  • Qubole has added support for filtering tables considered for refreshing the statistics using wildcard patterns.
  • The user’s Hive bootstrap is now executed before running Automatic Statistics queries.

QHIVE-4839: Hive statistics auto-gather for basic statistics and column statistics are available but only Qubole Support can enable this enhancement. It collects statistics on INSERT and INSERT OVERWRITE queries. Via Support | Cluster Restart Required

HiveServer2 Enhancements

These are the enhancements:

  • QHIVE-3996: The HiveServer2 query execution latency is reduced by 1-1.5 seconds.

  • QHIVE-4786: You can now configure HS2 clusters to use private IP for communication between the coordinator node and worker nodes.

    To enable:

Multi-instance HiveServer2 Enhancements

These are the enhancements:

Hive ACID Enhancements

Qubole has added enhancements to the Hive ACID feature, which are:

  • QHIVE-4707: Hive Streaming API is now supported with Hive 3.1.1 (beta). There is a limitation in the case of blob stores: there must be one transaction per batch size because blob stores do not support partial writes.

    Note

    Blob stores in this context should not be confused with Microsoft Azure Blob Storage. For information about Hive and blob storage, see the Apache Hive documentation.

  • QHIVE-4840: Qubole has introduced an enhancement that allows users to delay the obsolete data cleanup after compaction. To use this enhancement, set hive.compactor.delayed.cleanup.enabled=true. You can also configure a delay in the cleanup using the CLEANER_RETENTION_TIME_SECONDS table property. Disabled | Cluster Restart Required

Faster Downscaling

QHIVE-4740: Qubole supports custom Tez shuffle handler in Hive 3.1.1 (beta), which can speed up the worker nodes’ downscaling process in a Hadoop (Hive) cluster. Via Support | Cluster Restart Required

For more information, see the documentation.

Enhancements

  • QHIVE-2601: From Hive 3.1.1 (beta) onwards, Qubole supports merging small files at the end of MapReduce jobs and Tez DAGs.

  • QHIVE-4683: Qubole has added the path validation to the ALTER TABLE RECOVER PARTITIONS command. For more information, see the documentation.

  • QHIVE-4829: Qubole has added support for the Surrogate Keys function in Hive 3.1.1 (beta). For more details, see HIVE-20536.

  • QHIVE-4834: Qubole has backported open-source fix for the vectorized limit operator returning the incorrect number of results with offset. Related open-source jira: HIVE-22164.

  • QHIVE-4856: In Tez, you can use hive-exec jar that is locally available on cluster nodes. This reduces the overhead of localization. It increases the efficiency by avoiding additional HDFS operations.

  • QHIVE-4966: To reduce AWS read API calls in Hive 3.1.1 (beta), Qubole has changed default values of the following configuration properties:

    • mapred.min.split.size=256MB
    • mapred.max.split.size=256MB
  • QHIVE-4873: Qubole has backported open-source fixes to avoid the issue where Hive queries with JOIN condition with date/timestamp/INTERVAL fail with SemanticException.

    Related open-source Hive jira issues:

  • QHIVE-5020: Qubole provides an option to disable running Hive commands on a Presto cluster. Via Support | Cluster Restart Required

  • QTEZ-473: Qubole has optimized Tez-0.9.1 UI to run faster (OSS TEZ-4085).

Bug Fixes

  • QHIVE-4807: Fixed an error case in MapJoin conversion when no table is selected as a big table (OSS HIVE-22201).
  • QHIVE-4839: Fixed an issue with Hive statistics auto-gather feature that occurred during a multi INSERT Hive query.
  • QHIVE-4849: Qubole has changed timezone in the Tez UI to UTC and the time format to D days, H hours. This eliminates differences in the time-format and the timezone between ResourceManager and Tez.
  • QHIVE-4885: The ORC filename is printed along with the error when there is InvalidProtocolBufferException while reading PostScript of an ORC file to help you to inspect and ensure that file has a valid PostScript.
  • QHIVE-4925: Qubole has upgraded the commons-lang3 version to 3.4. This fixes an issue which caused Hive queries to fail with the java.lang.NoSuchMethodError: org.apache.commons.lang3.StringUtils.isNoneEmpty error.
  • QHIVE-4978: Fixed the issue when the number of Auto Statistics commands running was more than the limit set for the account.
  • QHIVE-4996: Fixed the issue when the Auto Statistics command was not triggered in INSERT and INSERT OVERWRITE queries.
  • QHIVE-5010: Fixed the issue when the Auto Statistics command did not get triggered if two accounts contain the same cluster tag for the maintenance cluster.

For a list of bug fixes between versions R57 and R58, see Changelog for api.qubole.com.