Data Dictionary

This dictionary provides a detailed description of all the datasets and their schema that are shared as part of Qubole Cost Explorer. The following are the supported datasets:

View Sample Queries.

cluster_node_usage_daily

This table captures the Qubole Compute Usage (QCU) and estimated_vm_cost($) values corresponding to the daily usage of each VM/node. The same node can have multiple rows (one per day) if the node runs for multiple days.

Note

The QCU numbers provided are an approximation and may not match the billing data exactly.

Use the env, account_id, event_date, and vm_id fields to uniquely identify a record.

ColumnName DataType Partition Column ? Description
vm_id String No The unique cloud provider generated ID for each node launched by Qubole. You can look it up in the cloud console.
cluster_inst_id Integer No The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
cluster_id Integer No The Qubole cluster ID.
cluster_type String No

Can be any of the following values:

  • Spark
  • Airflow
  • Presto
  • Hive
vm_pricing_type String No Denotes whether the node type is Spot, On-Demand or Spot Block. The AWS values are: Spot/OnDemand.
vm_instance_type String No The VM instance type. For example, AWS: r3.2xlarge, r4.2xlarge, and so on.
availability_zone String No The Availability Zone in which the node was launched.
region String No The Region in which the node/cluster instance was launched. All nodes of a cluster instance are in the same region.
qcu Decimal (30,8) No The Qubole defined compute-unit usage value for a given cluster instance referred to by its cluster instance ID.
estimated_vm_cost Double No The prorated cost incurred by the node (Spot, Spot Block or On-Demand) for that date. For example, if an On-Demand node (costing $1 per hour) ran only for 6 minutes, the value would be $0.10 (6/60 minutes). This is an estimated cost based on the publicly available On-Demand and Spot price. It does not include any RI or other discounts.
vm_run_time Integer No The duration (in seconds) for the node that ran on that date (event_date). For nodes that are spread across multiple days, this column represents the daily value and not the total value. To arrive at the total value, use the SUM() aggregate function.
ondemand_price Double No The On-Demand price of the same node instance type in the same region for that date (event_date). (Price in USD.)
env String Yes The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.
account_id String Yes The Qubole account ID.
event_date String Yes The date for which the compute-capacity usage is calculated (in the yyyy-mm-dd format).

clusters_metadata_daily

This table captures metadata information for a cluster instance. It contains the cluster_labels and cloud_tags associated with each cluster_inst for each day. Since a cluster’s label and cloud tag can be changed during the day, a record includes all labels and tags associated with the cluster instance for that day. As a result, there is only one record per day for a cluster instance.

Use the env, account_id, event_date, and cluster_instance_id fields to uniquely identify a record.

ColumnName DataType Partition Column ? Description
cluster_id Integer No The Qubole cluster ID to which the cluster instance and cluster labels belong.
cluster_inst_id Integer No The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
cluster_labels Array<String> No The cluster labels associated with that cluster instance.
cloud_tags Map<String, String> No The custom VM tags (ec2_tags for AWS) associated with the cluster. These tags are applied to the cluster nodes so that they can be tracked in the cloud billing console. Each tag is stored in the map as a key-value pair.
env String Yes The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.
account_id String Yes The Qubole account ID.
event_date String Yes The date for which the state is captured (in the yyyy-mm-dd format).

command_usage table

This table stores the command level usage metrics. For each command, Qubole attributes vcore-seconds, QCU, and estimated cost (in USD) in the command run. In general, there is only one record per command.

For Workflow command types, there can be multiple subcommands triggered internally. As a result, there are multiple records in the table for the same command ID. However, the engine ID is different for each subcommand. The number of records for that command ID are the same as the number of associated subcommands. Usage metrics and status correspond to each subcommand. Therefore, to find a command level usage, you must aggregate metrics across all the records of that command ID.

The command_id is the Qubole generated command identifier. It is Null when commands are directly submitted to the clusters or if Qubole is unable to attribute the usage of a cluster to any particular command.

The engine_identifier is the yarn_application_id (identifier generated by the Yarn application) if the engine type is Tez, MR, or Spark. It is the engine_id (identifier generated by the engine) if the engine type is Presto, or the query is of the DDL type. It is Null if Qubole is unable to attribute the usage of a cluster to any particular command.

If Qubole is unable to attribute the usage of a cluster instance to any command, a record is created for that cluster instance for that day with a null value for both the command and engine IDs. The user identifier in this case is Idle.

The following are possible reasons for not being able to attribute usage:

  • Unsupported cluster version when the cluster has not been restarted for a long time
  • Unsupported custom packages on the cluster
  • Collection of data by Qubole is not allowed

You can use the env, account_id, event_date, command_id, engine, engine_identifier, user_identifier and cluster_inst_id fields to uniquely identify a record. For example,

  • engine in case a command internally uses multiple engines
  • engine_identifier in case a command is executed into multiple subcommands or when a command is directly submitted to clusters (like notebook commands)
  • user_identifier and cluster_inst_id in cases a cluster is run without any command with idle usage

For commands running across the day, the command record is created for the day on which it ended as command metrics are captured only after the command has ended.

The command_text and command_text_type columns are only populated for Presto commands. They are populated as PrestoCommand for Presto. For all other command types, they are populated as Null.

ColumnName DataType Partition Column ? Description
cluster_id Integer No The Qubole cluster ID on which this particular command was executed.
cluster_inst_id Integer No The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated.
engine String No The engine on which the command ran. It can be Spark, Tez, MR, or Presto.
user_identifier String No The email ID of the user, the user identifier, or the Qubole user ID. The field is populated as Idle when no commands are running on the cluster, or Qubole is unable to attribute the usage to a particular command.
command_id BigInt No The Qubole generated command ID. This is Null if commands are directly submitted to clusters.
engine_identifier String No The Yarn application ID or the engine generated unique identifier for a command. In cases where a Qubole command (such as a Workflow command) submits multiple subcommands internally, the command ID is the same in this table for each such command. However, the engine ID is different and each subcommand has a separate record in the table.
submit_time Timestamp No The time the command/subcommand was submitted. For example, 2019-10-01 10:41:38.0.
end_time Timestamp No The end time of the command/subcommand. For example, 2019-10-01 10:42:11.0.
duration Double No The run time (in milliseconds) of the command/subcommand.
status String No

The status of the command/subcommand. It can be one of the following:

  • SUCCEEDED (when the command was successful)
  • FAILED (when the command has failed)
  • KILLED (when the command was canceled/killed)
  • UNDEFINED (For those queries/jobs of direct commands where YARN was unable to fetch the status)
command_source String No

The source from where the command was submitted. It can be any of the following:

  • User interface
  • Template
  • ODBC
  • API
  • Direct command (commands directly submitted to clusters)
  • Smart query
  • Scheduled
  • Quest
command_type String No

The type of the command. It can be any of the following:

  • Spark command
  • Presto command
  • DB Export command
  • Hive command
  • Hadoop command
  • Composite command
  • Direct command
  • DB Import command
  • SQL command
  • Shell command
  • Notebook Dashboard command
qcu Double No The Qubole defined compute-unit usage (AWS: QCU, Azure: AVMU) for the command/subcommand.
estimated_cost Double No The VM cost (in USD) used by the command/subcommand .
vcore_seconds Double No The cumulative Vcore-seconds used by the command/subcommand.
command_counters Map<String, Double> No The map containing other metrics related to the command/ subcommand.
command_tags Array<String> No The array of query tags used to submit the command/ subcommand.
command_text String No The query text, code, path of the query text, etc. The command_text_type defines the type of text present in this column.
command_text_type String No

The type of command_text:

  • query_statement
  • query_path
env String Yes The Qubole environment: api.qubole.com, us.qubole.com, in.qubole.com or eu.qubole.com.
account_id Integer Yes The Qubole account ID of the customer.
event_date String Yes The date of the completion of the command (in the yyyy-mm-dd format).

Sample Queries

  • Daily cost of clusters associated with a cluster label (uses the cluster_node_usage_daily and clusters_metadata_daily tables)
SELECT a.event_date,a.account_id,a.env,a.cluster_label,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
FROM (select event_date, account_id, cluster_id, env,explode(cluster_labels) as cluster_label
       FROM qubole_bi_us_382.clusters_metadata_daily
             WHERE event_date >='2019-12-15'
             and event_date <='2019-12-20'
             and env = 'us') a
join (SELECT *
       FROM qubole_bi_us_382.cluster_node_usage_daily
       WHERE event_date >='2019-12-15'
       and event_date <='2019-12-20'
       and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cluster_label;
  • Daily cost of clusters associated with a cloud tag (uses the cluster_node_usage_daily and clusters_metadata_daily tables)
SELECT a.event_date,a.account_id,a.env,a.cloud_tags,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
from(select event_date, account_id, cluster_id, env,concat_ws(':',key,value) as cloud_tags
from(select event_date, account_id, cluster_id, env,explode(cloud_tags)
            from qubole_bi_us_382.clusters_metadata_daily
            where event_date >='2019-12-15'
            and event_date <='2019-12-20'
            and env = 'us') t ) a
join (SELECT *
           from qubole_bi_us_382.cluster_node_usage_daily
           where event_date >='2019-12-15'
           and event_date <='2019-12-20'
           and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cloud_tags;