ApplicationHistoryServer-put-req-failures
This runbook shows steps to take when the failure rate of put requests to the Application History Server (AHS) exceeds 10%.
Alert Name: ApplicationHistoryServer put req failures
Alert Message: “ApplicationHistoryServer put req failure_rate is more than 10%”
Alert Explanation: The alert indicates that more than 10% of put requests to AHS have failed.
Resolution:
Logs: are available in the following directory:
/media/ephemeral0/logs/yarn/yarn-yarn-timelineserver-ip-*.log
Restart by running the following commands on the cluster’s master node:
sudo monit summary
to check the statussudo monit stop timelineserver
for stopping the processsudo monit start timelineserver
for starting the process
Check the timeline server logs to see why requests to AHS are failing. If there are out of memory (OOM) exceptions, increase the heap size of AHS and restart it.
If this alert continues to appear, restart AHS and inform the #escalation-hive, #escalation-hadoop, and #solutions channels.