This section discusses scenarios where the Hadoop jobs are stuck (or) in a pending state.
Job is Stuck due to Scheduling on Spot Nodes¶
By default, Qubole does not schedule ApplicationMasters (AMs) on spot nodes. This is because such nodes can go away at
any time and losing the AM of a YARN application can be disastrous. This default is specified by setting the
yarn.scheduler.qubole.am-on-stable.timeout.ms to -1.
However, there may be cases where you want to run AMs on such spot nodes, for example, when you are really cost
conscious and the cluster contains primarily of spot nodes. In that case, you can use this parameter to set a timeout.
The ResourceManager (RM) tries to schedule AMs on stable nodes first, however, after the timeout is hit and the RM has not
been able to schedule the AM, it considers the spot nodes as well. So, when you set
0, RM immediately considers all nodes when trying to schedule the AM.
The value of
yarn.scheduler.qubole.am-on-stable.timeout.ms is in milliseconds. (ms) and its supported values are
-1(default): RM does not schedule AMs on spot nodes.
0: RM schedules AMs on spot nodes whenever possible (that is it waits for
- Any other value: RM waits for the time set before letting RM to schedule an AM on a spot node.