Understanding the Hive Data Model¶
Data in QDS Hive is organized as tables and table partitions. These tables and partitions can either be created from data that you already have in Cloud storage, or can be generated as an output of running Hive queries. QDS uses HiveQL to query this data. For a primer on Hive, see the Apache Hive wiki.
The following topics are covered in this section:
- Types of Hive Tables
- Usage Scenarios
Types of Hive Tables¶
Tables in QDS Hive are backed by data residing either in Cloud storage or in HDFS (Hadoop Distributed File System). The table types available in QDS Hive are:
- External Tables: These tables are assigned an explicit location by the user. When an external table is dropped, Hive does not delete the data in the location that it points to.
- Regular Tables: These tables do not have an explicit location attribute and are assigned by one by Hive directly. Hive assigns a location relative to a default location that is fixed for an account. When a regular table is dropped, the data in the table is deleted.
- Temporary Tables: QDS Hive allows a third form of tables that is deleted automatically once the user’s session is deleted.
- Mongo backed Tables : You can create a Hive table whose underlying data resides in a Mongo database collection. When the table is queried, Qubole dynamically fetches the data from the Mongo database. See Mongo Backed Tables for more information.
The following examples illustrate how the above classes of table storage can be used in QDS.
To go through these examples, use the example dataset derived from the
Wikipedia Page Traffic Statistics dataset that is uploaded to the
paid-qubole bucket (for example,
s3n://paid-qubole/default-datasets/miniwikistats/ on AWS or
wasb://firstname.lastname@example.org/miniwikistats/ in Azure blob storage).
mini-wikistats dataset is smaller than the original: it has data only from the last hour of
each day and only for 7 days. This is a public bucket. The full dataset is
available at Wikipedia Page Traffic Statistics.
The examples below can be made to work against the original,larger dataset; just remove the
Given pre-existing data in a bucket in Cloud storage, we can create an external table over the data to begin analyzing it, as in the example(s) that follow.
CREATE EXTERNAL TABLE miniwikistats (projcode string, pagename string, pageviews int, bytes int) PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/';
Azure Blob Example:
CREATE EXTERNAL TABLE miniwikistats (projcode string, pagename string, pageviews int, bytes int) PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 'wasb://[email protected]/miniwikistats/';
This table is used as a reference in subsequent examples. It assumes the following:
- The data is entirely in text and contains four fields that are separated using the space character; the rows are separated by the newline character.
- The data lives under
<scheme>is the Cloud-specific URI and path, as in the examples above.
- The dataset is partitioned (by hour for example) so that each hour’s statistics are in a
separate subfolder. For example, the statistics for first hour of 20110101 are in
The CREATE statement above creates a partitioned table, but it does not populate any partitions in it, so the table is empty (even though this Cloud location has data). To populate the partitions in this table, see Partitions: run the first command in that section and then continue with the examples below.
On AWS you can optionally provide the secret and access keys as part of the table location URL, for example:
You might create derived data sets while analyzing the original data sets and might want to keep them for a period. In this case, you can create a regular table in Cloud storage, for instance:
CREATE TABLE q1_miniwikistats AS SELECT projcode, sum(pageviews) FROM miniwikistats WHERE dt >= '20110101' AND dt <= '20110105' GROUP BY projcode;
You may want to force a table to reside in HDFS. Such tables provide faster throughput, but bear in mind that they are automatically deleted at the end of the session.
You can use either TMP or TEMPORARY when creating temporary tables in QDS. See How can I create a table in HDFS? for a discussion of the differences.
CREATE TMP TABLE tmp_stats AS SELECT * FROM miniwikistats WHERE projcode LIKE 'ab%' AND dt > '20110101';
You can look up the location of all tables created in this manner by using
DESCRIBE FORMATTED tmp_stats;
You do not need to configure storage for temporary tables (always in HDFS) or external tables (which you can see by explicitly listing the entire location).
Regular tables are handled as follows:
As part of setting up an account, you can set a default location in your Cloud storage, with the credentials needed to access this location (read/write/list). This location is used to store logs and results, and by default, QDS creates regular tables in a subdirectory of this same location, so you don’t need to supply the credentials again to get access to the tables. If you choose to create external tables in a location that is not accessible via the account’s storage credentials, you can specify the credentials as part of the location URL.
To help you get started, QDS creates some read-only tables for each account. These tables are backed by a public requester-pays bucket on AWS and similar object storage on Azure and Oracle OCI. The tables are as follows:
Large data sets are typically divided into directories. Directories map to partitions in Hive. Currently, partitions in Hive must be populated manually by means of a command such as the following (which uses the table from the example under External Tables):
ALTER TABLE miniwikistats RECOVER PARTITIONS;
Now all the data in the location
<scheme>/miniwikistats/ can be
queried through the miniwikistats table via the Analyze page in the QDS UI. (
<scheme> is the Cloud-specific URI and path,
as in the examples above.)
To see the partitions:
SHOW PARTITIONS miniwikistats;
Partitions are created only for the subdirectories that exist when the command runs. If more directories are added later (for example if the data is being continuously loaded), run the command again.
Qubole’s version of
recover partitions also supports recovering specific partitions using patterns.
ALTER TABLE RECOVER PARTITIONS command removes partitions pointing to a custom location.
MSCK REPAIR TABLE command supports partitions pointing to a custom location.
For example, given these directories in the miniwikistats dataset,
/default-datasets/miniwikistats/20110101-01 /default-datasets/miniwikistats/20110101-02 /default-datasets/miniwikistats/20110102-01 /default-datasets/miniwikistats/20110102-02 /default-datasets/miniwikistats/20110103-01 /default-datasets/miniwikistats/20110103-02 /default-datasets/miniwikistats/20110104-01 /default-datasets/miniwikistats/20110104-02 /default-datasets/miniwikistats/20110105-01 /default-datasets/miniwikistats/20110105-02
you can create a table as follows:
CREATE EXTERNAL TABLE slice_miniwikistats ... LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/'
for Azure Blob:
ALTER TABLE slice_miniwikistats RECOVER PARTITIONS LIKE '20110102*'
recovers two partitions only (20110102-01 and 20110102-02).