Data Dictionary
This dictionary provides a detailed description of all the datasets and their schema that are shared as part of Qubole Cost Explorer. The following are the supported datasets:
View Sample Queries.
cluster_node_usage_daily
This table captures the Qubole Compute Usage (QCU) and estimated_vm_cost($) values corresponding to the daily usage of each VM/node. The same node can have multiple rows (one per day) if the node runs for multiple days.
Note
The QCU numbers provided are an approximation and may not match the billing data exactly.
Use the env, account_id, event_date, and vm_id fields to uniquely identify a record.
ColumnName |
DataType |
Partition Column ? |
Description |
---|---|---|---|
vm_id |
String |
No |
The unique cloud provider generated ID for each node launched by Qubole. You can look it up in the cloud console. |
cluster_inst_id |
Integer |
No |
The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated. |
cluster_id |
Integer |
No |
The Qubole cluster ID. |
cluster_type |
String |
No |
Can be any of the following values:
|
vm_pricing_type |
String |
No |
Denotes whether the node type is Spot, On-Demand or Spot Block. The AWS values are: Spot/OnDemand. |
vm_instance_type |
String |
No |
The VM instance type. For example, AWS: r3.2xlarge, r4.2xlarge, and so on. |
availability_zone |
String |
No |
The Availability Zone in which the node was launched. |
region |
String |
No |
The Region in which the node/cluster instance was launched. All nodes of a cluster instance are in the same region. |
qcu |
Decimal (30,8) |
No |
The Qubole defined compute-unit usage value for a given cluster instance referred to by its cluster instance ID. |
estimated_vm_cost |
Double |
No |
The prorated cost incurred by the node (Spot, Spot Block or On-Demand) for that date. For example, if an On-Demand node (costing $1 per hour) ran only for 6 minutes, the value would be $0.10 (6/60 minutes). This is an estimated cost based on the publicly available On-Demand and Spot price. It does not include any RI or other discounts. |
vm_run_time |
Integer |
No |
The duration (in seconds) for the node that ran on that date (event_date). For nodes that are spread across multiple days, this column represents the daily value and not the total value. To arrive at the total value, use the SUM() aggregate function. |
ondemand_price |
Double |
No |
The On-Demand price of the same node instance type in the same region for that date (event_date). (Price in USD.) |
env |
String |
Yes |
The Qubole environment: |
account_id |
String |
Yes |
The Qubole account ID. |
event_date |
String |
Yes |
The date for which the compute-capacity usage is calculated (in the yyyy-mm-dd format). |
clusters_metadata_daily
This table captures metadata information for a cluster instance. It contains the cluster_labels and cloud_tags associated with each cluster_inst for each day. Since a cluster’s label and cloud tag can be changed during the day, a record includes all labels and tags associated with the cluster instance for that day. As a result, there is only one record per day for a cluster instance.
Use the env, account_id, event_date, and cluster_instance_id fields to uniquely identify a record.
ColumnName |
DataType |
Partition Column ? |
Description |
---|---|---|---|
cluster_id |
Integer |
No |
The Qubole cluster ID to which the cluster instance and cluster labels belong. |
cluster_inst_id |
Integer |
No |
The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated. |
cluster_labels |
Array<String> |
No |
The cluster labels associated with that cluster instance. |
cloud_tags |
Map<String, String> |
No |
The custom VM tags (ec2_tags for AWS) associated with the cluster. These tags are applied to the cluster nodes so that they can be tracked in the cloud billing console. Each tag is stored in the map as a key-value pair. |
env |
String |
Yes |
The Qubole environment: |
account_id |
String |
Yes |
The Qubole account ID. |
event_date |
String |
Yes |
The date for which the state is captured (in the yyyy-mm-dd format). |
command_usage table
This table stores the command level usage metrics. For each command, Qubole attributes vcore-seconds, QCU, and estimated cost (in USD) in the command run. In general, there is only one record per command.
For Workflow command types, there can be multiple subcommands triggered internally. As a result, there are multiple records in the table for the same command ID. However, the engine ID is different for each subcommand. The number of records for that command ID are the same as the number of associated subcommands. Usage metrics and status correspond to each subcommand. Therefore, to find a command level usage, you must aggregate metrics across all the records of that command ID.
The command_id is the Qubole generated command identifier. It is Null when commands are directly submitted to the clusters or if Qubole is unable to attribute the usage of a cluster to any particular command.
The engine_identifier is the yarn_application_id (identifier generated by the Yarn application) if the engine type is Tez, MR, or Spark. It is the engine_id (identifier generated by the engine) if the engine type is Presto, or the query is of the DDL type. It is Null if Qubole is unable to attribute the usage of a cluster to any particular command.
If Qubole is unable to attribute the usage of a cluster instance to any command, a record is created for that cluster instance for that day with a null value for both the command and engine IDs. The user identifier in this case is Idle.
The following are possible reasons for not being able to attribute usage:
Unsupported cluster version when the cluster has not been restarted for a long time
Unsupported custom packages on the cluster
Collection of data by Qubole is not allowed
You can use the env, account_id, event_date, command_id, engine, engine_identifier, user_identifier and cluster_inst_id fields to uniquely identify a record. For example,
engine in case a command internally uses multiple engines
engine_identifier in case a command is executed into multiple subcommands or when a command is directly submitted to clusters (like notebook commands)
user_identifier and cluster_inst_id in cases a cluster is run without any command with idle usage
For commands running across the day, the command record is created for the day on which it ended as command metrics are captured only after the command has ended.
The command_text and command_text_type columns are only populated for Presto commands. They are populated as PrestoCommand for Presto. For all other command types, they are populated as Null.
ColumnName |
DataType |
Partition Column ? |
Description |
---|---|---|---|
cluster_id |
Integer |
No |
The Qubole cluster ID on which this particular command was executed. |
cluster_inst_id |
Integer |
No |
The Qubole cluster instance ID. A cluster instance is defined as an instance of a cluster from the time it is started to the time it is terminated. |
engine |
String |
No |
The engine on which the command ran. It can be Spark, Tez, MR, or Presto. |
user_identifier |
String |
No |
The email ID of the user, the user identifier, or the Qubole user ID. The field is populated as Idle when no commands are running on the cluster, or Qubole is unable to attribute the usage to a particular command. |
command_id |
BigInt |
No |
The Qubole generated command ID. This is Null if commands are directly submitted to clusters. |
engine_identifier |
String |
No |
The Yarn application ID or the engine generated unique identifier for a command. In cases where a Qubole command (such as a Workflow command) submits multiple subcommands internally, the command ID is the same in this table for each such command. However, the engine ID is different and each subcommand has a separate record in the table. |
submit_time |
Timestamp |
No |
The time the command/subcommand was submitted. For example, 2019-10-01 10:41:38.0. |
end_time |
Timestamp |
No |
The end time of the command/subcommand. For example, 2019-10-01 10:42:11.0. |
duration |
Double |
No |
The run time (in milliseconds) of the command/subcommand. |
status |
String |
No |
The status of the command/subcommand. It can be one of the following:
|
command_source |
String |
No |
The source from where the command was submitted. It can be any of the following:
|
command_type |
String |
No |
The type of the command. It can be any of the following:
|
qcu |
Double |
No |
The Qubole defined compute-unit usage (AWS: QCU, Azure: AVMU) for the command/subcommand. |
estimated_cost |
Double |
No |
The VM cost (in USD) used by the command/subcommand . |
vcore_seconds |
Double |
No |
The cumulative Vcore-seconds used by the command/subcommand. |
command_counters |
Map<String, Double> |
No |
The map containing other metrics related to the command/ subcommand. |
command_tags |
Array<String> |
No |
The array of query tags used to submit the command/ subcommand. |
command_text |
String |
No |
The query text, code, path of the query text, etc. The command_text_type defines the type of text present in this column. |
command_text_type |
String |
No |
The type of command_text:
|
env |
String |
Yes |
The Qubole environment: |
account_id |
Integer |
Yes |
The Qubole account ID of the customer. |
event_date |
String |
Yes |
The date of the completion of the command (in the yyyy-mm-dd format). |
Sample Queries
Daily cost of clusters associated with a cluster label (uses the cluster_node_usage_daily and clusters_metadata_daily tables)
SELECT a.event_date,a.account_id,a.env,a.cluster_label,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
FROM (select event_date, account_id, cluster_id, env,explode(cluster_labels) as cluster_label
FROM qubole_bi_us_382.clusters_metadata_daily
WHERE event_date >='2019-12-15'
and event_date <='2019-12-20'
and env = 'us') a
join (SELECT *
FROM qubole_bi_us_382.cluster_node_usage_daily
WHERE event_date >='2019-12-15'
and event_date <='2019-12-20'
and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cluster_label;
Daily cost of clusters associated with a cloud tag (uses the cluster_node_usage_daily and clusters_metadata_daily tables)
SELECT a.event_date,a.account_id,a.env,a.cloud_tags,sum(qcu) as qcu,
sum(estimated_vm_cost) as `cost($)`
from(select event_date, account_id, cluster_id, env,concat_ws(':',key,value) as cloud_tags
from(select event_date, account_id, cluster_id, env,explode(cloud_tags)
from qubole_bi_us_382.clusters_metadata_daily
where event_date >='2019-12-15'
and event_date <='2019-12-20'
and env = 'us') t ) a
join (SELECT *
from qubole_bi_us_382.cluster_node_usage_daily
where event_date >='2019-12-15'
and event_date <='2019-12-20'
and env = 'us' ) b on a.event_date = b.event_date and a.env = b.env and
a.cluster_id = b.cluster_id
group by a.event_date,a.account_id,a.env,a.cloud_tags;