ApplicationHistoryServer-put-req-failures

This runbook shows steps to take when the failure rate of put requests to the Application History Server (AHS) exceeds 10%.

Alert Name: ApplicationHistoryServer put req failures

Alert Message: “ApplicationHistoryServer put req failure_rate is more than 10%”

Alert Explanation: The alert indicates that more than 10% of put requests to AHS have failed.

Resolution:

  • Logs: are available in the following directory: /media/ephemeral0/logs/yarn/yarn-yarn-timelineserver-ip-*.log

  • Restart by running the following commands on the cluster’s master node:

    • sudo monit summary to check the status

    • sudo monit stop timelineserver for stopping the process

    • sudo monit start timelineserver for starting the process

  • Check the timeline server logs to see why requests to AHS are failing. If there are out of memory (OOM) exceptions, increase the heap size of AHS and restart it.

  • If this alert continues to appear, restart AHS and inform the #escalation-hive, #escalation-hadoop, and #solutions channels.