hive-hs2-query-failures

This runbook shows the steps required to address a high percentage of query failures on HiveServer2.

Alert Name: HS2 Query Failures

Alert Message: “More than 40% queries failing on HS2”

Alert Condition: This alert is returned by the following query: avg(last_1m):(avg:hive.hs2.failed_queries.count{*} by {host} - hour_before(avg:hive.hs2.failed_queries.count{*} by {host})) / (avg:hive.hs2.submitted_queries.count{*} by {host} - hour_before(avg:hive.hs2.submitted_queries.count{*} by {host}) + 1) >= 0.4

Alert Explanation: The alert indicates that the failure rate for queries on HiveServer2 is greater than 40%.

Resolution:

When this alert appears, perform the following steps:

  • Check the dashbords “HS2 Memory Usage” and “HS2 GC Time.” A high value for HS2 memory usage or GC time indicates that the load on HiveServer2 is high, and that HiveServer2 is spending excessive time in garbage collection. If memory usage or GC time remains high over time, restart HiveServer2.
  • Check the dashboard “Active Queries” to see the load on HiveServer2.
  • Check the query logs.