Cluster and Node Termination Causes

Clusters terminate for various reasons and not necessarily by manual or a preset automatic termination. Cluster nodes get terminated during autoscaling or health check. The following lists different causes for the clusters and nodes’ termination:

Cluster Termination Causes

These are some of the reasons clusters terminate:

  • INACTIVITY: This is the reason displayed when a cluster gets terminated due to inactivity (self-explanatory). It is governed by Idle Cluster Timeout, which is configurable in hours and/or minutes. If no jobs are running on a cluster or if a cluster node reaches its timely boundary, then Qubole identifies such cluster as inactive and it terminates that cluster. For more information, see Shutting Down an Idle Cluster, Understanding Aggressive Downscaling in Clusters (AWS), and Aggressive Downscaling (Azure).
  • HEALTH_CHECK_FAILED: This is the reason generally displayed when QDS discovers an unhealthy cluster. For example, if there are no nodes in a running cluster, or the ResourceManager does not exist, QDS identifies the cluster as unhealthy and terminates it.

Node Termination Causes

These are some of the reasons due to which cluster nodes terminate:

  • User Initiated: This is the reason displayed when the user terminates the cluster node.
  • Service Initiated: This is the reason displayed when there is a Spot node loss (initiated by the Cloud Provider).
  • Server.InternalError: This is the reason displayed when there is an internal server error due to which the cluster gets terminated. An error at the Cloud Provider’s end causes this error.
  • HEALTH_CHECK_FAILED: This is the reason displayed generally when there are unhealthy cluster nodes.