Optimizing Hive Queries

This section describes optimizations related to Hive queries.

From Hive 3.1.1 (beta) onwards, Qubole supports merging small files at the end of MapReduce jobs and Tez DAGs.

AWS S3 Listing Optimization

As part of the split computation, Hive needs to list all files in the table’s S3 location. The implementation in Apache Hadoop for listing files in S3 is very slow. QDS incorporates optimizations to speed it up. The following parameter is enabled by default for an AWS implementation.

set fs.s3.inputpathprocessor=true;

Changing SerDe to DelimitedJSONSerDe for Results with Complex Datatypes

Qubole Hive converts all SELECT queries to INSERT-OVERWRITE-DIRECTORY (IOD) format to save results back to a Cloud location.

When writing data to a directory, Apache Hive uses LazySimpleSerde for serialization (writing) of results/data. But LazySimpleSerde does not honor a <key,value> structure and ignores keys. Honoring the keys plays an important role in displaying data-typed columns.

In Qubole Hive, setting hive.qubole.directory.serde=org.apache.hadoop.hive.serde2.DelimitedJSONSerDe changes the SerDe to DelimitedJSONSerDe, which honors more complex datatypes such as Maps and Arrays. This configuration when set, helps you to view and use query results with complex datatypes correctly.

Handling Direct Writes of INSERT OVERWRITE Query Results

For INSERT OVERWRITE queries, Qubole Hive allows you to directly write the results to S3. While Apache Hive writes data to a temporary location and move them to S3. In S3, moving data is expensive (involves copy and delete operations). So, directly writing the INSERT OVERWRITE query results to S3 is an optimization that Qubole Hive offers you. However, there is an issue that you may face while writing INSERT OVERWRITE query results to S3 that is described below.

While writing to a partitioned and bucketed table using insert overwrite command, there is a good chance that multiple reducers simultaneously write the result file to a specific table’s location in S3. As it is an INSERT OVERWRITE command, the existing files written before the current job in the table’s location are deleted before the reducer tasks write new result files. In this situation, a scenario can occur where the delete request sent to S3 by one reduce task (say R1) gets throttled due to the S3 issue and by this time, another reduce task (say R2) deletes the old files and writes a new file. The delete request sent by reduce task R1 is now processed by S3 and it ends up deleting the file written by reduce task R2. To overcome this issue, Qubole provides an enhancement to avoid files from being deleted by reduce tasks. The enhancement is not enabled on the QDS account by default. To enable it on the account, create a ticket with Qubole Support.

When the enhancement is enabled, a prefix which is unique per job is added to the result files. This ensures that only old files which do not have the latest prefix are deleted. Thus it solves the data loss issue which can happen due to S3 throttling when multiple reducers try to simultaneously write query results.