Considerations for using Hive ACID Data Source for Spark

There are certain operational considerations you should be aware of when using Hive ACID Data Source for Spark.

Transactional Guarantees

  • Spark transactional read guarantees are at the dataframe level. When dataframe is created first time the snapshot is acquired. All future reads of this dataframe return the same snapshot.

  • Spark on Qubole does not support transactional guarantees at SQL level like PrestoSQL and Hive. Reading the same table twice in a single SQL statement potentially translates into two different snapshot of the table. For example, SELECT * FROM T1, T1 where T1.id1 = T1.id2

  • Guarantees do not fully hold when used with s3 as storage subsystem due to eventually consistent nature. The underlying design essential gets list of valid write IDs from Hive Metastore and uses object store’s listing facility to find valid files to be read. In S3, eventually consistent nature of listing might not return all the files. Therefore, it is recommended to use S3 Guard for proper transactional guarantees.

Compaction and Cleaner

Spark reads acquire snapshot at the dataframe level. The files in the snapshot needs to be protected till the dataframe is in use. There are two ways to achieve this: acquire lock at the partition level or do not delete file when in use. Spark does not acquire lock as lifetime of dataframe can be very long and would block inserts on the table. To protect the snapshot file from getting deleted, cleaner(run by Hive Metastore Server) must be disabled while performing the reads. Disabling Cleaner does not have any performance implication on reads/writes, only the storage requirement will be higher as system might hold on to all in use snapshot copies.

  • Use ALTER TABLE T1 SET TBLPROPERTIES('NO_CLEANER' = 'true') to disable the cleaner.

    For more information, see Compaction of Hive Transaction Delta Directories.

  • Operations like ALTER TABLE DROP COLUMN, TRUNCATE TABLE, DROP TABLE, which delete files directory outside cleaner, must be disallowed while reading from Spark.

Hive in Spark Cluster

To use Spark cluster to work with Hive ACID table the cluster should have HMS 3.1.1. Hence, running Hive commands on Spark cluster is not supported.