View Cluster Health APIs

GET /api/v1.3/clusters/(string:id or label)/live_cluster_health

Use this API to view the latest health of a running cluster in a Qubole environment. It is supported with QDS version R57 onwards. It is supported in Cluster API v1.3, v2.0, and v2.1.

Required Role

Users belong to a group that has permission to read a cluster required to invoke this API.

Request API Syntax

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json"
"https://api.qubole.com/api/v2/clusters/<cluster_id>/live_cluster_health

As the value of <Qubole Environment>, use the Qubole environment where you have the QDS account. For example, https://api.qubole.com is a Qubole environment.

Note

The above syntax uses cluster v2 and the response below are for the cluster API version 2.0.

Sample Response

Hive (as an additional cluster with HiveServer2 disabled and HiveServer2 enabled):

{
  "cluster_id": 5,
  "cluster_inst_id": 2,
     "metrics": {
        "captured_at": "2019-10-03T08:22:02Z",
        "engine": {
            "yarn": {
               "memory": "0",
               "containers": {
                   "pending": "0",
                   "failed": "0",
                   "killed": "0"
               }
           }
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
               "heap": {
                   "usage_percent": "3.96",
                   "status": "green"
               }
           },
           "resourcemanager": "UP",
           "namenode": "UP"
       },
       "system": {
           "master": {
               "cpu_usage": "8.61",
               "disk_usage": "70.7",
               "spotloss_count": 0
           }
       }
   }
}

Hive (With HiveServer2 enabled on master):

{
  "cluster_id": 1,
  "cluster_inst_id": 1,
  "metrics": {
      "captured_at": "2019-10-03T08:22:07Z",
      "engine": {
          "yarn": {
              "memory": "0",
              "containers": {
                  "pending": "0",
                  "failed": "0",
                  "killed": "0"
              }
          }
      },
      "daemons": {
          "hive_metastore": {
              "responsiveness_status": "UP",
              "liveliness_status": "UP",
              "heap": {
                  "usage_percent": "3.83",
                  "status": "green"
              }
          },
          "hs2_server": {
              "responsiveness_status": "UP",
              "liveliness_status": "UP",
              "heap": {
                  "usage_percent": "1.88",
                  "status": "green"
              }
          },
          "resourcemanager": "UP",
          "namenode": "UP"
      },
      "system": {
          "master": {
              "cpu_usage": "10.55",
              "disk_usage": "70.7",
              "spotloss_count": 0
          }
      }
  }}

Presto:

{
   "cluster_id": 213,
   "cluster_inst_id": 738,
   "metrics": {
       "captured_at": "2019-10-03T08:05:09Z",
        "engine": {
           "presto": {
               "status": "UP",
               "heap": {
                   "usage_percent": "0.48",
                   "status": "green"
               }
           }
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
                 "heap": {
                     "usage_percent": "15.5",
                     "status": "green"
                }
             },
             "zeppelin": {
                 "status": "UP",
                 "heap": {
                     "usage_percent": "2.08",
                     "status": "green"
                 }
             }
         },
         "system": {
             "master": {
                 "cpu_usage": "20.41",
                 "disk_usage": "71.5",
                 "spotloss_count": 0
             }
         }
     }
   }

Spark:

{
    "cluster_id": 473,
    "cluster_inst_id": 737,
    "metrics": {
        "captured_at": "2019-10-03T08:00:15Z",
        "engine": {
            "yarn": {
                "memory": "0",
                "containers": {
                    "pending": "0",
                    "failed": "0",
                    "killed": "0"
                }
            },
            "spark": {}
        },
        "daemons": {
            "hive_metastore": {
                "responsiveness_status": "UP",
                "liveliness_status": "UP",
                "heap": {
                    "usage_percent": "10.07",
                    "status": "green"
                }
            },
            "zeppelin": {
                "status": "UP",
                "heap": {
                    "usage_percent": "3.68",
                    "status": "green"
                }
            },
            "resourcemanager": "UP",
            "namenode": "UP"
        },
        "system": {
            "master": {
                "cpu_usage": "0",
                "disk_usage": "73.0",
                "spotloss_count": 0
            }
        }
    }
}

Airflow:

{
   "cluster_id": 473,
   "cluster_inst_id": 737,
   "metrics": {
       "captured_at": "2019-10-03T08:00:15Z",
       "engine": {
           "yarn": {
               "memory": "0",
               "containers": {
                   "pending": "0",
                   "failed": "0",
                   "killed": "0"
               }
           },
           "spark": {}
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
               "heap": {
                   "usage_percent": "10.07",
                   "status": "green"
               }
           },
           "zeppelin": {
               "status": "UP",
               "heap": {
                   "usage_percent": "3.68",
                   "status": "green"
               }
           },
           "resourcemanager": "UP",
           "namenode": "UP"
       },
       "system": {
           "master": {
               "cpu_usage": "0",
               "disk_usage": "73.0",
               "spotloss_count": 0
           }
       }
   }
}

Note

YARN-based metrics are only available when Ganglia is enabled on the cluster.

Cluster Health Services and Metrics Information:

Metrics/Service Available On Cluster Type
Binary Metrics (Services)
Hive Metastore All
Name Node Hive, Spark
Resource Manager Hive, Spark
HS2 Hive (HS2 enabled on master)
Zeppelin Spark, Presto
Presto Presto
Bar Metrics (Float)
CPU Usage All
Master Disk Usage All
Spot nodes lost count (Integer) All
Heap Information (All heap metrics are calculated from jstat command)
Hive Metastore Heap All
HS2 Heap Hive (HS2 enabled on master)
Presto Heap Presto
Zeppelin Heap Presto, Spark