This runbook shows steps to take when the failure rate of put requests to the Application History Server (AHS) exceeds 10%.
Alert Name: ApplicationHistoryServer put req failures
Alert Message: “ApplicationHistoryServer put req failure_rate is more than 10%”
Alert Explanation: The alert indicates that more than 10% of put requests to AHS have failed.
- Logs: are available in the following directory:
- Restart by running the following commands on the cluster’s master node:
sudo monit summaryto check the status
sudo monit stop timelineserverfor stopping the process
sudo monit start timelineserverfor starting the process
- Check the timeline server logs to see why requests to AHS are failing. If there are out of memory (OOM) exceptions, increase the heap size of AHS and restart it.
- If this alert continues to appear, restart AHS and inform the #escalation-hive, #escalation-hadoop, and #solutions channels.