Data Dictionary

This dictionary provides a detailed description of all the datasets and their schema that are shared as part of Qubole Cost Explorer. The following are the supported datasets:

cluster_node_usage_daily
clusters_metadata_daily
command_usage

View Sample Queries.

cluster_node_usage_daily

This table captures the Qubole Compute Usage (QCU) and estimated_vm_cost($) values corresponding to the daily usage of each VM/node. The same node can have multiple rows (one per day) if the node runs for multiple days.

Note

The QCU numbers provided are an approximation and may not match the billing data exactly.

Use the env, account_id, event_date, and vm_id fields to uniquely identify a record.

ColumnName	DataType	Partition Column ?	Description
vm_id	String	No	The unique cloud provider generated ID for each node launched by Qubole. You can look it up in the cloud console.
cluster_inst_id	Integer	No	The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
cluster_id	Integer	No	The Qubole cluster ID.
cluster_type	String	No	Can be any of the following values: Spark Airflow Presto Hive
vm_pricing_type	String	No	Denotes whether the node type is Spot, On-Demand or Spot Block. The AWS values are: Spot/OnDemand.
vm_instance_type	String	No	The VM instance type. For example, AWS: r3.2xlarge, r4.2xlarge, and so on.
availability_zone	String	No	The Availability Zone in which the node was launched.
region	String	No	The Region in which the node/cluster instance was launched. All nodes of a cluster instance are in the same region.
qcu	Decimal (30,8)	No	The Qubole defined compute-unit usage value for a given cluster instance referred to by its cluster instance ID.
estimated_vm_cost	Double	No	The prorated cost incurred by the node (Spot, Spot Block or On-Demand) for that date. For example, if an On-Demand node (costing $1 per hour) ran only for 6 minutes, the value would be $0.10 (6/60 minutes). This is an estimated cost based on the publicly available On-Demand and Spot price. It does not include any RI or other discounts.
vm_run_time	Integer	No	The duration (in seconds) for the node that ran on that date (event_date). For nodes that are spread across multiple days, this column represents the daily value and not the total value. To arrive at the total value, use the SUM() aggregate function.
ondemand_price	Double	No	The On-Demand price of the same node instance type in the same region for that date (event_date). (Price in USD.)
env	String	Yes	The Qubole environment: `api.qubole.com`, `us.qubole.com`, `in.qubole.com` or `eu.qubole.com`.
account_id	String	Yes	The Qubole account ID.
event_date	String	Yes	The date for which the compute-capacity usage is calculated (in the yyyy-mm-dd format).

clusters_metadata_daily

This table captures metadata information for a cluster instance. It contains the cluster_labels and cloud_tags associated with each cluster_inst for each day. Since a cluster’s label and cloud tag can be changed during the day, a record includes all labels and tags associated with the cluster instance for that day. As a result, there is only one record per day for a cluster instance.

Use the env, account_id, event_date, and cluster_instance_id fields to uniquely identify a record.

ColumnName	DataType	Partition Column ?	Description
cluster_id	Integer	No	The Qubole cluster ID to which the cluster instance and cluster labels belong.
cluster_inst_id	Integer	No	The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
cluster_labels	Array<String>	No	The cluster labels associated with that cluster instance.
cloud_tags	Map<String, String>	No	The custom VM tags (ec2_tags for AWS) associated with the cluster. These tags are applied to the cluster nodes so that they can be tracked in the cloud billing console. Each tag is stored in the map as a key-value pair.
env	String	Yes	The Qubole environment: `api.qubole.com`, `us.qubole.com`, `in.qubole.com` or `eu.qubole.com`.
account_id	String	Yes	The Qubole account ID.
event_date	String	Yes	The date for which the state is captured (in the yyyy-mm-dd format).

command_usage table

This table stores the command level usage metrics. For each command, Qubole attributes vcore-seconds, QCU, and estimated cost (in USD) in the command run. In general, there is only one record per command.

For Workflow command types, there can be multiple subcommands triggered internally. As a result, there are multiple records in the table for the same command ID. However, the engine ID is different for each subcommand. The number of records for that command ID are the same as the number of associated subcommands. Usage metrics and status correspond to each subcommand. Therefore, to find a command level usage, you must aggregate metrics across all the records of that command ID.

The command_id is the Qubole generated command identifier. It is Null when commands are directly submitted to the clusters or if Qubole is unable to attribute the usage of a cluster to any particular command.

The engine_identifier is the yarn_application_id (identifier generated by the Yarn application) if the engine type is Tez, MR, or Spark. It is the engine_id (identifier generated by the engine) if the engine type is Presto, or the query is of the DDL type. It is Null if Qubole is unable to attribute the usage of a cluster to any particular command.

If Qubole is unable to attribute the usage of a cluster instance to any command, a record is created for that cluster instance for that day with a null value for both the command and engine IDs. The user identifier in this case is Idle.

The following are possible reasons for not being able to attribute usage:

Unsupported cluster version when the cluster has not been restarted for a long time
Unsupported custom packages on the cluster
Collection of data by Qubole is not allowed

You can use the env, account_id, event_date, command_id, engine, engine_identifier, user_identifier and cluster_inst_id fields to uniquely identify a record. For example,

engine in case a command internally uses multiple engines
engine_identifier in case a command is executed into multiple subcommands or when a command is directly submitted to clusters (like notebook commands)
user_identifier and cluster_inst_id in cases a cluster is run without any command with idle usage

For commands running across the day, the command record is created for the day on which it ended as command metrics are captured only after the command has ended.

The command_text and command_text_type columns are only populated for Presto commands. They are populated as PrestoCommand for Presto. For all other command types, they are populated as Null.

ColumnName	DataType	Partition Column ?	Description
cluster_id	Integer	No	The Qubole cluster ID on which this particular command was executed.
cluster_inst_id	Integer	No	The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
engine	String	No	The engine on which the command ran. It can be Spark, Tez, MR, or Presto.
user_identifier	String	No	The email ID of the user, the user identifier, or the Qubole user ID. The field is populated as Idle when no commands are running on the cluster, or Qubole is unable to attribute the usage to a particular command.
command_id	BigInt	No	The Qubole generated command ID. This is Null if commands are directly submitted to clusters.
engine_identifier	String	No	The Yarn application ID or the engine generated unique identifier for a command. In cases where a Qubole command (such as a Workflow command) submits multiple subcommands internally, the command ID is the same in this table for each such command. However, the engine ID is different and each subcommand has a separate record in the table.
submit_time	Timestamp	No	The time the command/subcommand was submitted. For example, 2019-10-01 10:41:38.0.
end_time	Timestamp	No	The end time of the command/subcommand. For example, 2019-10-01 10:42:11.0.
duration	Double	No	The run time (in milliseconds) of the command/subcommand.
status	String	No	The status of the command/subcommand. It can be one of the following: SUCCEEDED (when the command was successful) FAILED (when the command has failed) KILLED (when the command was canceled/killed) UNDEFINED (For those queries/jobs of direct commands where YARN was unable to fetch the status)
command_source	String	No	The source from where the command was submitted. It can be any of the following: User interface Template ODBC API Direct command (commands directly submitted to clusters) Smart query Scheduled Quest
command_type	String	No	The type of the command. It can be any of the following: Spark command Presto command DB Export command Hive command Hadoop command Composite command Direct command DB Import command SQL command Shell command Notebook Dashboard command
qcu	Double	No	The Qubole defined compute-unit usage (AWS: QCU, Azure: AVMU) for the command/subcommand.
estimated_cost	Double	No	The VM cost (in USD) used by the command/subcommand .
vcore_seconds	Double	No	The cumulative Vcore-seconds used by the command/subcommand.
command_counters	Map<String, Double>	No	The map containing other metrics related to the command/ subcommand.
command_tags	Array<String>	No	The array of query tags used to submit the command/ subcommand.
command_text	String	No	The query text, code, path of the query text, etc. The command_text_type defines the type of text present in this column.
command_text_type	String	No	The type of command_text: query_statement query_path
env	String	Yes	The Qubole environment: `api.qubole.com`, `us.qubole.com`, `in.qubole.com` or `eu.qubole.com`.
account_id	Integer	Yes	The Qubole account ID of the customer.
event_date	String	Yes	The date of the completion of the command (in the yyyy-mm-dd format).

Sample Queries

Daily cost of clusters associated with a cluster label (uses the cluster_node_usage_daily and clusters_metadata_daily tables)

SELECT a.event_date,a.account_id,a.env,a.cluster_label,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
FROM (select event_date, account_id, cluster_id, env,explode(cluster_labels) as cluster_label
       FROM qubole_bi_us_382.clusters_metadata_daily
             WHERE event_date >='2019-12-15'
             and event_date <='2019-12-20'
             and env = 'us') a
join (SELECT *
       FROM qubole_bi_us_382.cluster_node_usage_daily
       WHERE event_date >='2019-12-15'
       and event_date <='2019-12-20'
       and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cluster_label;

Daily cost of clusters associated with a cloud tag (uses the cluster_node_usage_daily and clusters_metadata_daily tables)

SELECT a.event_date,a.account_id,a.env,a.cloud_tags,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
from(select event_date, account_id, cluster_id, env,concat_ws(':',key,value) as cloud_tags
from(select event_date, account_id, cluster_id, env,explode(cloud_tags)
            from qubole_bi_us_382.clusters_metadata_daily
            where event_date >='2019-12-15'
            and event_date <='2019-12-20'
            and env = 'us') t ) a
join (SELECT *
           from qubole_bi_us_382.cluster_node_usage_daily
           where event_date >='2019-12-15'
           and event_date <='2019-12-20'
           and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cloud_tags;