Introduction
Hive is an Apache open-source project built on top of Hadoop for querying, summarizing, and analyzing large data sets using a SQL-like interface. It is noted for bringing the familiarity of relational technology to Big Data processing with its Hive Query Language as well as structures and operations comparable to those used with relational databases such as tables, JOINs and partitions.
Apache Hive accepts Hive Query Language (similar to SQL) and converts to Apache Tez jobs. Apache Tez is an application framework that can run complex pipelines of operators to process data. It replaces the MapReduce engine.
Hive’s architecture is optimized for batch processing of large ETL jobs and batch SQL queries on very large data sets. Hive features include:
Metastore: The Hive metastore stores the metadata for Hive tables and partitions in a relational database. The metastore provides client access to the information it contains through the metastore service API.
Hive Client and HiveServer2: Users submit HQL statements to the Hive Client or HiveServer2 (HS2). These function as a controller and manage the query lifecycle. After a query completes, the results are returned to the user. HS2 is a long running daemon that implements many features to improve speed of planning and optimizing HQL queries. HS2 also supports sessions, which provide features such as temporary tables, a useful feature for ETL Jobs.
Checkpointing of intermediate results: Apache Hive and Apache Tez checkpoint intermediate results of some stages. Intermediate results are stored in HDFS. Checkpointing allows fast recovery when tasks fail. Hive can restart tasks from the previous check point.
Speculative Execution: It helps to improve speed of queries by redoing work that is lagging due to hardware or networking issues.
Fair Scheduler: Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop Scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple users. Fair sharing can also work with job priorities. The priorities are used as weights to determine the fraction of total compute time that each job gets.
About Qubole Hive
Qubole’s Hive distribution is derived from the Apache Hive versions 0.13, 1.2.0, 2.1.1, 2.3, and 3.1. However, there are a few differences in the functionality. Qubole Hive is a self-managing and self-optimizing implementation of Apache Hive.
Qubole Hive:
Runs on your choice of popular public cloud providers
Leverages the QDS platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on outcome, instead of the platform. For more information on AIR, see How to Use Auto-Completion and Suggestions and Getting Data Model Insights and Recommendations.
Has agent technology that augments original Hive with a self-managing and self-optimizing platform
Is cloud-optimized for faster workload performance
Is easier to integrate with existing data sources and tools
Provides best-in-class security
Understanding Hive Versions describes the different versions of Hive supported on QDS.