View Cluster Health APIs

GET /api/v1.3/clusters/(string:id or label)/live_cluster_health

Use this API to view the latest health of a running cluster in a Qubole environment. It is supported with QDS version R57 onwards. It is supported in Cluster API v1.3, v2.0, and v2.1.

Required Role

Users belong to a group that has permission to read a cluster required to invoke this API.

Request API Syntax

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json"
"https://api.qubole.com/api/v2/clusters/<cluster_id>/live_cluster_health

As the value of <Qubole Environment>, use the Qubole environment where you have the QDS account. For example, https://api.qubole.com is a Qubole environment.

Note

The above syntax uses cluster v2 and the response below are for the cluster API version 2.0.

Sample Response

Hive (as an additional cluster with HiveServer2 disabled and HiveServer2 enabled):

{
  "cluster_id": 5,
  "cluster_inst_id": 2,
     "metrics": {
        "captured_at": "2019-10-03T08:22:02Z",
        "engine": {
            "yarn": {
               "memory": "0",
               "containers": {
                   "pending": "0",
                   "failed": "0",
                   "killed": "0"
               }
           }
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
               "heap": {
                   "usage_percent": "3.96",
                   "status": "green"
               }
           },
           "resourcemanager": "UP",
           "namenode": "UP"
       },
       "system": {
           "master": {
               "cpu_usage": "8.61",
               "disk_usage": "70.7",
               "spotloss_count": 0
           }
       }
   }
}

Hive (With HiveServer2 enabled on coordinator):

{
  "cluster_id": 1,
  "cluster_inst_id": 1,
  "metrics": {
      "captured_at": "2019-10-03T08:22:07Z",
      "engine": {
          "yarn": {
              "memory": "0",
              "containers": {
                  "pending": "0",
                  "failed": "0",
                  "killed": "0"
              }
          }
      },
      "daemons": {
          "hive_metastore": {
              "responsiveness_status": "UP",
              "liveliness_status": "UP",
              "heap": {
                  "usage_percent": "3.83",
                  "status": "green"
              }
          },
          "hs2_server": {
              "responsiveness_status": "UP",
              "liveliness_status": "UP",
              "heap": {
                  "usage_percent": "1.88",
                  "status": "green"
              }
          },
          "resourcemanager": "UP",
          "namenode": "UP"
      },
      "system": {
          "master": {
              "cpu_usage": "10.55",
              "disk_usage": "70.7",
              "spotloss_count": 0
          }
      }
  }}

Presto:

{
   "cluster_id": 213,
   "cluster_inst_id": 738,
   "metrics": {
       "captured_at": "2019-10-03T08:05:09Z",
        "engine": {
           "presto": {
               "status": "UP",
               "heap": {
                   "usage_percent": "0.48",
                   "status": "green"
               }
           }
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
                 "heap": {
                     "usage_percent": "15.5",
                     "status": "green"
                }
             },
             "zeppelin": {
                 "status": "UP",
                 "heap": {
                     "usage_percent": "2.08",
                     "status": "green"
                 }
             }
         },
         "system": {
             "master": {
                 "cpu_usage": "20.41",
                 "disk_usage": "71.5",
                 "spotloss_count": 0
             }
         }
     }
   }

Spark:

{
    "cluster_id": 473,
    "cluster_inst_id": 737,
    "metrics": {
        "captured_at": "2019-10-03T08:00:15Z",
        "engine": {
            "yarn": {
                "memory": "0",
                "containers": {
                    "pending": "0",
                    "failed": "0",
                    "killed": "0"
                }
            },
            "spark": {}
        },
        "daemons": {
            "hive_metastore": {
                "responsiveness_status": "UP",
                "liveliness_status": "UP",
                "heap": {
                    "usage_percent": "10.07",
                    "status": "green"
                }
            },
            "zeppelin": {
                "status": "UP",
                "heap": {
                    "usage_percent": "3.68",
                    "status": "green"
                }
            },
            "resourcemanager": "UP",
            "namenode": "UP"
        },
        "system": {
            "master": {
                "cpu_usage": "0",
                "disk_usage": "73.0",
                "spotloss_count": 0
            }
        }
    }
}

Airflow:

{
   "cluster_id": 473,
   "cluster_inst_id": 737,
   "metrics": {
       "captured_at": "2019-10-03T08:00:15Z",
       "engine": {
           "yarn": {
               "memory": "0",
               "containers": {
                   "pending": "0",
                   "failed": "0",
                   "killed": "0"
               }
           },
           "spark": {}
       },
       "daemons": {
           "hive_metastore": {
               "responsiveness_status": "UP",
               "liveliness_status": "UP",
               "heap": {
                   "usage_percent": "10.07",
                   "status": "green"
               }
           },
           "zeppelin": {
               "status": "UP",
               "heap": {
                   "usage_percent": "3.68",
                   "status": "green"
               }
           },
           "resourcemanager": "UP",
           "namenode": "UP"
       },
       "system": {
           "master": {
               "cpu_usage": "0",
               "disk_usage": "73.0",
               "spotloss_count": 0
           }
       }
   }
}

Note

YARN-based metrics are only available when Ganglia is enabled on the cluster.

Cluster Health Services and Metrics Information:

Metrics/Service

Available On Cluster Type

Binary Metrics (Services)

Hive Metastore

All

Name Node

Hive, Spark

Resource Manager

Hive, Spark

HS2

Hive (HS2 enabled on coordinator)

Zeppelin

Spark, Presto

Presto

Presto

Bar Metrics (Float)

CPU Usage

All

Coordinator Disk Usage

All

Spot nodes lost count (Integer)

All

Heap Information (All heap metrics are calculated from jstat command)

Hive Metastore Heap

All

HS2 Heap

Hive (HS2 enabled on coordinator)

Presto Heap

Presto

Zeppelin Heap

Presto, Spark