Data Model

Data in QDS HIve is organized as tables and table partitions. These tables and partitions can either be created from data that you already have in Cloud storage, or can be generated as an output of running Hive queries. QDS uses HiveQL to query this data. For a primer on Hive, see the Apache Hivewiki.

The following topics are covered in this section:

Types of Hive Tables

Tables in QDS Hive are backed by data residing either in Cloud storage or in HDFS (Hadoop Distributed File System). The table types available in QDS Hive are:

  • External Tables: These tables are assigned an explicit location by the user. When an external table is dropped, Hive does not delete the data in the location that it points to.
  • Regular Tables: These tables do not have an explicit location attribute and are assigned by one by Hive directly. Hive assigns a location relative to a default location that is fixed for an account. When a regular table is dropped, the data in the table is deleted.
  • Temporary Tables: QDS Hive allows a third form of tables that is deleted automatically once the user’s session is deleted.
  • Mongo backed Tables (not currently supported for Azure): This allows AWS users to create a Hive table whose underlying data resides in a Mongo database collection. When the table is queried, Qubole dynamically fetches the data from the Mongo database. See Mongo Backed Tables for more information.

Usage Scenarios (AWS)

The following examples illustrate how the above classes of table storage can be used in QDS. To go through these examples, use the example dataset derived from the Wikipedia Page Traffic Statistics dataset that is uploaded to the paid-qubole AWS bucket (paid-qubole/default-datasets/miniwikistats/).

The mini-wikistats dataset is smaller than the original: it has data only from the last hour of each day and only for 7 days. This bucket is a public (requester-pays) bucket. The full dataset is available at Wikipedia Page Traffic Statistics.

The examples below can be made to work against the original and larger dataset; just remove the mini prefix.

External Tables

A typical user has pre-existing data in some S3 bucket. An external table can be created over such data to begin analyzing it. For instance:

CREATE EXTERNAL TABLE
miniwikistats (projcode string, pagename string, pageviews int, bytes int)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/';

This table is used as a reference in subsequent examples. Hence, it is worth diving into the assumptions built into this statement.

  1. The data is assumed to be entirely in text and contains 4 fields that are separated using space character and the rows are separated by newline character.
  2. The data lives under the location s3n://paid-qubole/default-datasets/miniwikistats/.
  3. Finally, the dataset is assumed to be partitioned (by hour for example) so that each hour’s statistics is in a separate subfolder within that location. For example, the statistics for first hour of 20110101 is in the location s3n://paid-qubole/default-datasets/miniwikistats/20110101-01/.
  4. Note that the AWS secret and access keys can be provided as part of the table location url, for example:
s3n://ACCESS_KEY:SECRET_KEY@paid-qubole/default-datasets/miniwikistats/20110101-01/
..note:: The statement above creates a partitioned table. However, it does not populate any partitions in it. Without
populating partitions, the table is empty (even though this S3 location has data). To populate partitions in this table, refer to the section on Partitions. Run the first command mentioned in the Partitions section and continue to the examples mentioned below.

Regular Tables

The user may create derived data sets while analyzing the original data sets and may want to keep them for a long lived period. In such a case, a regular table can be created, which can also be located in S3, for instance:

CREATE TABLE q1_miniwikistats
AS
SELECT projcode, sum(pageviews)
FROM miniwikistats
WHERE dt >= '20110101' AND dt <= '20110105'
GROUP BY projcode;

Temporary Tables

The user may want to force a table to be located in HDFS to take advantage of its higher speed and/or the fact that it is automatically deleted at the end of the session.

CREATE TMP TABLE tmp_stats AS
SELECT * FROM miniwikistats
WHERE projcode LIKE 'ab%' AND dt > '20110101';

The location of all tables created in this manner can be looked up using the describe command.

DESCRIBE FORMATTED tmp_stats;

Configuring Table Storage

Storage for temporary tables in QDS Hive is always in HDFS/Hadoop. Similarly, storage for external tables is relatively simple to understand. The user can explicitly list the entire location. So you have to only worry about configuring the S3 storage for regular tables.

As part of setting up an account, you can set a default location in S3, and credentials used to access (read/write/list) this location. QDS creates regular tables in a subdirectory of this location. (Other subdirectories under the same folder are used to store logs and results). Locations for external tables can specify credentials as part of the location URL. However, this is only required if external table locations are not accessible using the storage credentials specified in the account.

Default Tables

To help you get started, for each account, Qubole creates some read-only tables, by default. These tables are backed by a public requester-pays bucket in Amazon. The tables created right now are as follows:

  • default_qubole_demo_airline_origin_destination: This has a subset of the data described here .
  • default_qubole_memetracker: 96 million memes collected between 2008 and 2009 as described here .

Partitions

Large data sets are typically divided into directories. Directories map to partitions in Hive. Currently, partitions in Hive must be populated manually using the following command (picking up from the previous create external table example):

ALTER TABLE miniwikistats RECOVER PARTITIONS;

Following this command, all the data in the location s3n://paid-qubole/default-datasets/miniwikistats/ can be queried through the table miniwikistats via the Analyze page.

Note

The partitions are created only for the subdirectories that exist while running this command. If more directories are added later, for example if the data is being continuously loaded, this command must be run again. To see the partitions of that table, execute the following statement:

SHOW PARTITIONS miniwikistats;

Qubole’s version of recover partitions also supports recovering specific partitions using patterns. For example, consider the directory structure of the miniwikistats dataset.

/default-datasets/miniwikistats/20110101-01
/default-datasets/miniwikistats/20110101-02
/default-datasets/miniwikistats/20110102-01
/default-datasets/miniwikistats/20110102-02
/default-datasets/miniwikistats/20110103-01
/default-datasets/miniwikistats/20110103-02
/default-datasets/miniwikistats/20110104-01
/default-datasets/miniwikistats/20110104-02
/default-datasets/miniwikistats/20110105-01
/default-datasets/miniwikistats/20110105-02

A table pointing to the miniwikistats dataset directory

CREATE EXTERNAL TABLE slice_miniwikistats ...
LOCATION 's3n://paid-qubole/default-datasets/miniwikistats/'

Then

ALTER TABLE slice_miniwikistats RECOVER PARTITIONS LIKE '20110102*'

Altering table recovers two partitions only (20110102-01 and 20110102-02).