ApplicationHistoryServer-put-req-failures
This runbook shows steps to take when the failure rate of put requests to the Application History Server (AHS) exceeds 10%.
Alert Name: ApplicationHistoryServer put req failures
Alert Message: “ApplicationHistoryServer put req failure_rate is more than 10%”
Alert Explanation: The alert indicates that more than 10% of put requests to AHS have failed.
Resolution:
Logs: are available in the following directory:
/media/ephemeral0/logs/yarn/yarn-yarn-timelineserver-ip-*.logRestart by running the following commands on the cluster’s master node:
sudo monit summaryto check the statussudo monit stop timelineserverfor stopping the processsudo monit start timelineserverfor starting the process
Check the timeline server logs to see why requests to AHS are failing. If there are out of memory (OOM) exceptions, increase the heap size of AHS and restart it.
If this alert continues to appear, restart AHS and inform the #escalation-hive, #escalation-hadoop, and #solutions channels.