Data in QDS Hive is organized as tables and table partitions. These tables and partitions can either be created from data that you already have in Cloud storage, or can be generated as an output of running Hive queries. QDS uses HiveQL to query this data. For a primer on Hive, see the Apache Hivewiki.
The following topics are covered in this section:
- Types of Hive Tables
- Usage Scenarios (AWS):
Types of Hive Tables¶
Tables in QDS Hive are backed by data residing either in Cloud storage or in HDFS (Hadoop Distributed File System). The table types available in QDS Hive are:
- External Tables: These tables are assigned an explicit location by the user. When an external table is dropped, Hive does not delete the data in the location that it points to.
- Regular Tables: These tables do not have an explicit location attribute and are assigned by one by Hive directly. Hive assigns a location relative to a default location that is fixed for an account. When a regular table is dropped, the data in the table is deleted.
- Temporary Tables: QDS Hive allows a third form of tables that is deleted automatically once the user’s session is deleted.
- Mongo backed Tables (currently AWS only): This allows AWS users to create a Hive table whose underlying data resides in a Mongo database collection. When the table is queried, Qubole dynamically fetches the data from the Mongo database. See Mongo Backed Tables for more information.
Usage Scenarios (AWS)¶
The following examples illustrate how the above classes of table storage can be used in QDS on AWS.
To go through these examples, use the example dataset derived from the
Wikipedia Page Traffic Statistics dataset that is uploaded to the
paid-qubole AWS bucket (
mini-wikistats dataset is smaller than the original: it has data only from the last hour of
each day and only for 7 days. This bucket is a public (requester-pays) bucket. The full dataset is
available at Wikipedia Page Traffic Statistics.
The examples below can be made to work against the original and larger dataset; just remove the
A typical user has pre-existing data in some S3 bucket. An external table can be created over such data to begin analyzing it. For instance:
CREATE EXTERNAL TABLE miniwikistats (projcode string, pagename string, pageviews int, bytes int) PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/';
This table is used as a reference in subsequent examples. Hence, it is worth diving into the assumptions built into this statement.
- The data is assumed to be entirely in text and contains 4 fields that are separated using space character and the rows are separated by newline character.
- The data lives under the location
- Finally, the dataset is assumed to be partitioned (by hour for example) so that each hour’s statistics is in a
separate subfolder within that location. For example, the statistics for first hour of 20110101 is in the
- Note that the AWS secret and access keys can be provided as part of the table location url, for example:
The statement above creates a partitioned table. However, it does not populate any partitions in it, so the table is empty (even though this S3 location has data). To populate the partitions in this table, see Partitions: run the first command in that section and then continue with the examples below.
The user may create derived data sets while analyzing the original data sets and may want to keep them for a long lived period. In such a case, a regular table can be created, which can also be located in S3, for instance:
CREATE TABLE q1_miniwikistats AS SELECT projcode, sum(pageviews) FROM miniwikistats WHERE dt >= '20110101' AND dt <= '20110105' GROUP BY projcode;
The user may want to force a table to be located in HDFS to take advantage of its higher speed and/or the fact that it is automatically deleted at the end of the session.
CREATE TMP TABLE tmp_stats AS SELECT * FROM miniwikistats WHERE projcode LIKE 'ab%' AND dt > '20110101';
The location of all tables created in this manner can be looked up using the describe command.
DESCRIBE FORMATTED tmp_stats;
Configuring Table Storage¶
Storage for temporary tables in QDS Hive is always in HDFS. Similarly, storage for external tables is relatively simple to understand. The user can explicitly list the entire location. So you have to only worry about configuring the Cloud storage for regular tables.
As part of setting up an account, you can set a default location in your Cloud storage, with the credentials needed to access (read/write/list) this location. QDS creates regular tables in a subdirectory of this location. (Other subdirectories under the same folder are used to store logs and results). Locations for external tables can specify credentials as part of the location URL. However, this is only required if external table locations are not accessible using the storage credentials specified in the account.
Default Tables in AWS¶
To help you get started, for each account, Qubole creates some read-only tables, by default. These tables are backed by a public requester-pays bucket in AWS. The tables are as follows:
Large data sets are typically divided into directories. Directories map to partitions in Hive. Currently, partitions in Hive must be populated manually using the following command (picking up from the previous create external table example):
ALTER TABLE miniwikistats RECOVER PARTITIONS;
Following this command, all the data in the location
s3n://paid-qubole/default-datasets/miniwikistats/ can be
queried through the table miniwikistats via the Analyze page.
The partitions are created only for the subdirectories that exist while running this command. If more directories are added later, for example if the data is being continuously loaded, this command must be run again. To see the partitions of that table, execute the following statement:
SHOW PARTITIONS miniwikistats;
Qubole’s version of
recover partitions also supports recovering specific partitions using patterns. For example,
consider the directory structure of the miniwikistats dataset.
/default-datasets/miniwikistats/20110101-01 /default-datasets/miniwikistats/20110101-02 /default-datasets/miniwikistats/20110102-01 /default-datasets/miniwikistats/20110102-02 /default-datasets/miniwikistats/20110103-01 /default-datasets/miniwikistats/20110103-02 /default-datasets/miniwikistats/20110104-01 /default-datasets/miniwikistats/20110104-02 /default-datasets/miniwikistats/20110105-01 /default-datasets/miniwikistats/20110105-02
A table pointing to the miniwikistats dataset directory
CREATE EXTERNAL TABLE slice_miniwikistats ... LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/'
ALTER TABLE slice_miniwikistats RECOVER PARTITIONS LIKE '20110102*'
Altering table recovers two partitions only (20110102-01 and 20110102-02).