Hive Datasets as Schedule Dependency

This section describes how schedules can be set up to run only if the data is available in Apache Hive tables. Typically, schedules which run Hive commands depend on data in Hive tables.

CREATE EXTERNAL TABLE daily_tick_data (
    date2 string,
    open float,
    close float,
    high float,
    low float,
    volume INT,
    average FLOAT)
PARTITIONED BY (
    stock_exchange STRING,
    stock_symbol STRING,
    year STRING,
    date1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '<scheme>/stock_ticker';

date1 is a date with format YYYY-MM-DD

<scheme> is the Cloud-dependent URI and path: for example s3n://paid-qubole/default-datasets (AWS) or wasb://default-datasets@paidqubole.blob.core.windows.net (Azure).

The dataset is available from 2012-07-01. For this example, let us assume that the dataset is updated everyday at 1AM UTC and jobs are scheduled everyday at 2AM UTC.

The following query has to be executed every day:

SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1='$yesterday$' GROUP BY stock_symbol

The following sub-topics provide more information:

Partition Column Values

Qubole has to be informed about the new partitions that are added every day.

In the example, the following partitions are added on 2013-01-02:

stock_exchange = nasdaq stock_exchange = nasdaq stock_symbol = ibm stock_symbol = orcl year = 2013 year = 2013 date1=2013-01-01 date1=2013-01-01
stock_exchange = nyse stock_symbol = ibm year = 2013 date1=2013-01-01
stock_exchange = nyse stock_symbol = ibm year = 2013 date1=2013-01-01

For example, the partition columns can have the following values:

stock_exchange [nasdaq, nyse]
stock_symbol [ibm, orcl]
year %Y
date1 %Y-%m-%d

The above information has to be entered while submitting a job either through the UI or API.

The format of the partition columns, year and date1, does not change from one job to another. These are stored in the Hive metastore and do not need to be specified every time.

The format for date partition columns can be entered through the QDS UI or the API. For more information on Store Table Properties, see Store Table Properties.

See Configuring Hive Tables Data Dependency for more information on setting Hive table data dependency using the QDS UI.

Dataset Interval

In this example, the job runs every day and the dataset is generated every day. It is possible that the job runs at a frequency different from the interval at which the dataset is generated. For example, the following query is run once in seven days while the dataset is generated once a day.

SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1>'$sevendaysago$' AND date1 < '$today$' GROUP BY stock_symbol

Qubole needs additional information to schedule this job, as follows:

interval How often the data is generated.
window_start, window_end Defines the range of intervals to wait for. Each is an integer which is multiple of the interval.

For the purposes of this example, the values for interval, window\_start and window\_end are:

interval 1 day
window_start -6 (inclusive of seven days ago)
window_end 0 (inclusive of today )

As with the date formats of the partition columns, the interval at which the dataset is generated does not change often. interval can also be stored in the Hive metastore and need not be specified every time.

AWS users, see Configuring S3/Azure Blob Storage Files Data Dependency for more information.

Initial Instance

Initial instance specifies the first instance of the data that is available. This is useful when a new dataset is introduced. It is possible that some jobs at the beginning may not have all instances available and should not be generated.

Let us understand the dependency of data in AWS S3 files and Hive partitions required by the Qubole Scheduler for scheduling jobs. Dependencies are the prerequisites that must be met before a job can run.

Understanding AWS S3 Files Dependency

S3 files dependency implies that a job runs if the data has arrived in S3 buckets. You can schedule a job to run at a specific date and time, either once or on a repetitive basis if the data exists. You can define repeat intervals such as last 6 hours, last 6 days, last 3 weeks, and last 7 months. For more information, see Configuring S3/Azure Blob Storage Files Data Dependency.

To schedule jobs at periodic intervals, Qubole Scheduler requires the following information:

  • Start day or time (parameter: window_start)
  • End day or time (parameter: window_end)
  • Day or time interval that denotes when and how often data is generated (parameter: interval)
  • Nominal time which is the logical start time of an instance

The dependency must be defined as: s3://<bucket>/<folderinS3bucket>/<abc>-%Y-%m-%d-%H-%M-%S, for example: s3://abc.com/data/schedule-2014-12-31-00-00-00.

See Time class for more information on date and time placeholders.

The following table shows how to create data in S3 files for the previous day’s data with a daily interval.

Sequence ID Nominal Time Created At Dependency
1 2015-01-01 00:00:00 2015-04-22 10:00:00 s3://abc.com/data/schedule-2014-12-31-00…
2 2015-01-02 00:00:00 2015-04-22 10:15:00 s3://abc.com/data/schedule-2015-01-01-00…
3 2015-01-03 00:00:00 2015-04-22 10:30:00 s3://abc.com/data/schedule-2015-01-02-00…

The window_start and window_end parameters are relative to Nominal Time.

Nominal Time is the time for which the Schedule Action was processed and Created At is the time at which the Scheduler picked up the schedule. For more information, see Understanding the Qubole Scheduler Concepts.

Interpreting window_start Parameter Values

The value 0 implies now, -1 implies 1 day ago, and -2 implies 2 days ago.

Similarly, for an hourly/daily/weekly/monthly/yearly interval (frequency), the value 0 denotes now. -1 denotes 1 hour/day/week/month/year ago. -2 denotes 2 hour/day/week/month/year ago and so on.

Interpreting window_end Parameter Values

The Qubole Scheduler supports waiting for data. For example, waiting for 6 weeks of data implies that window_start is -5 and window_end is 0.

Note

When the data arrival interval and the scheduler interval are different, then the scheduler interval follows its own frequency to process the data. For example, if the data arrival interval is hourly and the scheduler interval is daily, the scheduler waits for an entire day’s data.*

Data and the scheduler can be in two different timezones. For example,

  {
  window_start => -48
  window_end => -24
  frequency => hourly
  time_zone => Americas/Los_Angeles
  }
scheduler_frequency => daily
time_zone => Americas/New_York

Understanding Hive Partition Dependency

The Qubole Scheduler allows data units to have Hive partitions. Data in Hive tables can be categorized by Hive partitions such as country or date. The Hive query example on this page contains Hive partitions. The scheduler recognizes the Hive partitions from the corresponding table properties in the Hive metastore. See Partitions for more information.

Timezone can be specified as an optional parameter in a hive query with daylight savings as on/off.

Hive tables can be partitioned by date and country. Dependency is expressed as %Y/%M/%d/["US", "CAN", "IRE"].