Understanding Autoscaling Logs (AWS)

Here are the interpretation of a few YARN autoscaling logs.

scalingAndUiInfoLog("Rebalancer bailing because of low replacement_count: " + replacementCount);

In the above log message, the replacement count is the number of volatile nodes the rebalancer requests to replace the additional stable nodes.

SCALING_LOG.info("Removing excluded hosts from Yarn and hdfs: " + decommissionedNodes);

The above log message means that the list of nodes are being removed from EXCLUDE files (HDFS and YARN EXCLUDES), which have been terminated.

SCALING_LOG.info("Max Cullable Data Nodes = " + maxCullable);

The above log message indicates the maximum removable data nodes from the cluster (however, the minimum number of nodes and the nodes required to store HDFS data remain untouched).

maxCullable is the maximum number of removable nodes bounded by the HDFS data storage.

SCALING_LOG.info("Updated Max Cullable based on MAPRED_HUSTLER_NODE_MAX_REQUEST: " + maxCullable);

The above log messages indicates maximum nodes that Qubole can remove on the basis of maximum number of cloud instance-termination requests configuration. The default maximum request count is 200.

SCALING_LOG.info("totalSpotNM: " + totalSpotNM + ", totalOnDemandNM: " + totalOnDemandNM + ", totalOnDemandNMInGS: " + totalOnDemandNMInGS + ", totalSpotNMInGS: " + totalSpotNMInGS + ", cullOnDemandNM: " + cullableOnDemandNM + ", cullSpotNM: " + cullableSpotNM);

In the above log message:

totalSpotNM/totalOnDemandNM: It is the total number of Spot/OnDemand nodes which are alive. It also includes nodes in the decommissioning state.
cullableSpotNM/cullableOndemandNM: It is the number of Spot/OnDemand nodes which are in the decommissioning state, idle, and within a release interval (meaning that Qubole can remove these nodes from YARN).

SCALING_LOG.info("Reducing cullOnDemandNM from: " + revisedCullOnDemandNM + " to maxCullable: " + maxCullable);

In the above log message, maxCullable is the maximum number of removable nodes bounded by the HDFS data storage and the maximum API requests configuration (maximum cloud API requests made in a single run). So, it reduces the maximum removable NodeManagers to also be bounded by HDFS data storage and maximum API requests.

SCALING_LOG.info("recalculated cullOnDemand: " + cullableOnDemandNM);

The above log message shows the maximum removable OnDemand nodes bounded by the Spot ratio to be maintained and the minimum cluster size.

SCALING_LOG.info("recalculated cullSpotNM: " + cullableSpotNM);

The above log message shows the maximum removable Spot nodes bounded by maximum removable nodes (considering the number of OnDemand instances that would be removed).

SCALING_LOG.info("nodesToRelease report at time: " + now + ". TotalOnDemandNM: " + totalOnDemandNM + ", totalSpotNM: " + totalSpotNM + ", cullOnDemandNM: " + currOnDemand + ", cullSpotNM: " + currSpot + ", minClusterSize: " + minClusterSize + ", spot percent: " + autoscale_node_spot_percent);

In the above log message:

totalOnDemandNM/totalSpotNM: It is the total number of OnDemand/Spot nodes which are active and may also be in the decommissioning state.
cullOnDemandNM/cullSpotNM: It is the final number of OnDemand/Spot nodes that would be in the decommissioning state, to be eventually removed from the cluster.

scalingAndUiInfoLog("Rebalancer bailing because total_nodes: " + totalNodeCount +  " is less than cluster_nodes: " + numHustlerNodes);

In the above log message:

Total_nodes: It is the total number of nodes reported by YARN (includes running + decommissioning nodes).
Cluster_nodes: It is the number of instances that are running as a part of the cluster and may/may not be a part of YARN.