Pending Jobs

This section discusses scenarios where the Hadoop jobs are stuck (or) in a pending state.

Job is Stuck due to Scheduling on Spot Nodes

By default, Qubole does not schedule ApplicationMasters (AMs) on spot nodes. This is because such nodes can go away at any time and losing the AM of a YARN application can be disastrous. This default is specified by setting the yarn.scheduler.qubole.am-on-stable.timeout.ms to -1.

However, there may be cases where you want to run AMs on such spot nodes, for example, when you are really cost conscious and the cluster contains primarily of spot nodes. In that case, you can use this parameter to set a timeout. The ResourceManager (RM) tries to schedule AMs on stable nodes first, however, after the timeout is hit and the RM has not been able to schedule the AM, it considers the spot nodes as well. So, when you set yarn.scheduler.qubole.am-on-stable.timeout.ms to 0, RM immediately considers all nodes when trying to schedule the AM.

The value of yarn.scheduler.qubole.am-on-stable.timeout.ms is in milliseconds. (ms) and its supported values are describe below:

  • -1 (default): RM does not schedule AMs on spot nodes.
  • 0: RM schedules AMs on spot nodes whenever possible (that is it waits for 0 ms).
  • Any other value: RM waits for the time set before letting RM to schedule an AM on a spot node.