Data Dictionary

This dictionary provides a detailed description of all the datasets and their schema that are shared as part of Qubole Cost Explorer. The following are the supported datasets:

View Sample Queries.

cluster_node_usage_daily

This table captures the Qubole Compute Usage (QCU) and estimated_vm_cost($) values corresponding to the daily usage of each VM/node. The same node can have multiple rows (one per day) if the node runs for multiple days.

Note

The QCU numbers provided are an approximation and may not match the billing data exactly.

Use the env, account_id, event_date, and vm_id fields to uniquely identify a record.

ColumnName

DataType

Partition Column ?

Description

vm_id

String

No

The unique cloud provider generated ID for each node launched by Qubole. You can look it up in the cloud console.

cluster_inst_id

Integer

No

The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.

cluster_id

Integer

No

The Qubole cluster ID.

cluster_type

String

No

Can be any of the following values:

  • Spark

  • Airflow

  • Presto

  • Hive

vm_pricing_type

String

No

Denotes whether the node type is Spot, On-Demand or Spot Block. The AWS values are: Spot/OnDemand.

vm_instance_type

String

No

The VM instance type. For example, AWS: r3.2xlarge, r4.2xlarge, and so on.

availability_zone

String

No

The Availability Zone in which the node was launched.

region

String

No

The Region in which the node/cluster instance was launched. All nodes of a cluster instance are in the same region.

qcu

Decimal (30,8)

No

The Qubole defined compute-unit usage value for a given cluster instance referred to by its cluster instance ID.

estimated_vm_cost

Double

No

The prorated cost incurred by the node (Spot, Spot Block or On-Demand) for that date. For example, if an On-Demand node (costing $1 per hour) ran only for 6 minutes, the value would be $0.10 (6/60 minutes). This is an estimated cost based on the publicly available On-Demand and Spot price. It does not include any RI or other discounts.

vm_run_time

Integer

No

The duration (in seconds) for the node that ran on that date (event_date). For nodes that are spread across multiple days, this column represents the daily value and not the total value. To arrive at the total value, use the SUM() aggregate function.

ondemand_price

Double

No

The On-Demand price of the same node instance type in the same region for that date (event_date). (Price in USD.)

env

String

Yes

The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.

account_id

String

Yes

The Qubole account ID.

event_date

String

Yes

The date for which the compute-capacity usage is calculated (in the yyyy-mm-dd format).

clusters_metadata_daily

This table captures metadata information for a cluster instance. It contains the cluster_labels and cloud_tags associated with each cluster_inst for each day. Since a cluster’s label and cloud tag can be changed during the day, a record includes all labels and tags associated with the cluster instance for that day. As a result, there is only one record per day for a cluster instance.

Use the env, account_id, event_date, and cluster_instance_id fields to uniquely identify a record.

ColumnName

DataType

Partition Column ?

Description

cluster_id

Integer

No

The Qubole cluster ID to which the cluster instance and cluster labels belong.

cluster_inst_id

Integer

No

The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.

cluster_labels

Array<String>

No

The cluster labels associated with that cluster instance.

cloud_tags

Map<String, String>

No

The custom VM tags (ec2_tags for AWS) associated with the cluster. These tags are applied to the cluster nodes so that they can be tracked in the cloud billing console. Each tag is stored in the map as a key-value pair.

env

String

Yes

The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.

account_id

String

Yes

The Qubole account ID.

event_date

String

Yes

The date for which the state is captured (in the yyyy-mm-dd format).

command_usage table

This table stores the command level usage metrics. For each command, Qubole attributes vcore-seconds, QCU, and estimated cost (in USD) in the command run. In general, there is only one record per command.

For Workflow command types, there can be multiple subcommands triggered internally. As a result, there are multiple records in the table for the same command ID. However, the engine ID is different for each subcommand. The number of records for that command ID are the same as the number of associated subcommands. Usage metrics and status correspond to each subcommand. Therefore, to find a command level usage, you must aggregate metrics across all the records of that command ID.

The command_id is the Qubole generated command identifier. It is Null when commands are directly submitted to the clusters or if Qubole is unable to attribute the usage of a cluster to any particular command.

The engine_identifier is the yarn_application_id (identifier generated by the Yarn application) if the engine type is Tez, MR, or Spark. It is the engine_id (identifier generated by the engine) if the engine type is Presto, or the query is of the DDL type. It is Null if Qubole is unable to attribute the usage of a cluster to any particular command.

If Qubole is unable to attribute the usage of a cluster instance to any command, a record is created for that cluster instance for that day with a null value for both the command and engine IDs. The user identifier in this case is Idle.

The following are possible reasons for not being able to attribute usage:

  • Unsupported cluster version when the cluster has not been restarted for a long time

  • Unsupported custom packages on the cluster

  • Collection of data by Qubole is not allowed

You can use the env, account_id, event_date, command_id, engine, engine_identifier, user_identifier and cluster_inst_id fields to uniquely identify a record. For example,

  • engine in case a command internally uses multiple engines

  • engine_identifier in case a command is executed into multiple subcommands or when a command is directly submitted to clusters (like notebook commands)

  • user_identifier and cluster_inst_id in cases a cluster is run without any command with idle usage

For commands running across the day, the command record is created for the day on which it ended as command metrics are captured only after the command has ended.

The command_text and command_text_type columns are only populated for Presto commands. They are populated as PrestoCommand for Presto. For all other command types, they are populated as Null.

ColumnName

DataType

Partition Column ?

Description

cluster_id

Integer

No

The Qubole cluster ID on which this particular command was executed.

cluster_inst_id

Integer

No

The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.

engine

String

No

The engine on which the command ran. It can be Spark, Tez, MR, or Presto.

user_identifier

String

No

The email ID of the user, the user identifier, or the Qubole user ID. The field is populated as Idle when no commands are running on the cluster, or Qubole is unable to attribute the usage to a particular command.

command_id

BigInt

No

The Qubole generated command ID. This is Null if commands are directly submitted to clusters.

engine_identifier

String

No

The Yarn application ID or the engine generated unique identifier for a command. In cases where a Qubole command (such as a Workflow command) submits multiple subcommands internally, the command ID is the same in this table for each such command. However, the engine ID is different and each subcommand has a separate record in the table.

submit_time

Timestamp

No

The time the command/subcommand was submitted. For example, 2019-10-01 10:41:38.0.

end_time

Timestamp

No

The end time of the command/subcommand. For example, 2019-10-01 10:42:11.0.

duration

Double

No

The run time (in milliseconds) of the command/subcommand.

status

String

No

The status of the command/subcommand. It can be one of the following:

  • SUCCEEDED (when the command was successful)

  • FAILED (when the command has failed)

  • KILLED (when the command was canceled/killed)

  • UNDEFINED (For those queries/jobs of direct commands where YARN was unable to fetch the status)

command_source

String

No

The source from where the command was submitted. It can be any of the following:

  • User interface

  • Template

  • ODBC

  • API

  • Direct command (commands directly submitted to clusters)

  • Smart query

  • Scheduled

  • Quest

command_type

String

No

The type of the command. It can be any of the following:

  • Spark command

  • Presto command

  • DB Export command

  • Hive command

  • Hadoop command

  • Composite command

  • Direct command

  • DB Import command

  • SQL command

  • Shell command

  • Notebook Dashboard command

qcu

Double

No

The Qubole defined compute-unit usage (AWS: QCU, Azure: AVMU) for the command/subcommand.

estimated_cost

Double

No

The VM cost (in USD) used by the command/subcommand .

vcore_seconds

Double

No

The cumulative Vcore-seconds used by the command/subcommand.

command_counters

Map<String, Double>

No

The map containing other metrics related to the command/ subcommand.

command_tags

Array<String>

No

The array of query tags used to submit the command/ subcommand.

command_text

String

No

The query text, code, path of the query text, etc. The command_text_type defines the type of text present in this column.

command_text_type

String

No

The type of command_text:

  • query_statement

  • query_path

env

String

Yes

The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.

account_id

Integer

Yes

The Qubole account ID of the customer.

event_date

String

Yes

The date of the completion of the command (in the yyyy-mm-dd format).

Sample Queries

  • Daily cost of clusters associated with a cluster label (uses the cluster_node_usage_daily and clusters_metadata_daily tables)

SELECT a.event_date,a.account_id,a.env,a.cluster_label,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
FROM (select event_date, account_id, cluster_id, env,explode(cluster_labels) as cluster_label
       FROM qubole_bi_us_382.clusters_metadata_daily
             WHERE event_date >='2019-12-15'
             and event_date <='2019-12-20'
             and env = 'us') a
join (SELECT *
       FROM qubole_bi_us_382.cluster_node_usage_daily
       WHERE event_date >='2019-12-15'
       and event_date <='2019-12-20'
       and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cluster_label;
  • Daily cost of clusters associated with a cloud tag (uses the cluster_node_usage_daily and clusters_metadata_daily tables)

SELECT a.event_date,a.account_id,a.env,a.cloud_tags,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
from(select event_date, account_id, cluster_id, env,concat_ws(':',key,value) as cloud_tags
from(select event_date, account_id, cluster_id, env,explode(cloud_tags)
            from qubole_bi_us_382.clusters_metadata_daily
            where event_date >='2019-12-15'
            and event_date <='2019-12-20'
            and env = 'us') t ) a
join (SELECT *
           from qubole_bi_us_382.cluster_node_usage_daily
           where event_date >='2019-12-15'
           and event_date <='2019-12-20'
           and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cloud_tags;