ApplicationHistoryServer-put-req-failures¶
This runbook shows steps to take when the failure rate of put requests to the Application History Server (AHS) exceeds 10%.
Alert Name: ApplicationHistoryServer put req failures
Alert Message: “ApplicationHistoryServer put req failure_rate is more than 10%”
Alert Explanation: The alert indicates that more than 10% of put requests to AHS have failed.
Resolution:
- Logs: are available in the following directory:
/media/ephemeral0/logs/yarn/yarn-yarn-timelineserver-ip-*.log
- Restart by running the following commands on the cluster’s master node:
sudo monit summary
to check the statussudo monit stop timelineserver
for stopping the processsudo monit start timelineserver
for starting the process
- Check the timeline server logs to see why requests to AHS are failing. If there are out of memory (OOM) exceptions, increase the heap size of AHS and restart it.
- If this alert continues to appear, restart AHS and inform the #escalation-hive, #escalation-hadoop, and #solutions channels.