Network Performance Metrics¶
HADTWO-1625: For detecting network performance related issues on worker nodes, the following two metrics are sent to the Ganglia server:
ping.packet.loss: It is the percentage (%) packet loss while executing a ping command from a worker node to the coordinator node.
ping.time: It is the RTT taken by a
pingcommand for sending 1000 packets to the coordinator node.
HADTWO-728: Qubole skips Spot requests for instance families for which Spot node losses were seen at the cluster level within a specific time interval, which is configurable. The default time interval is last 15 minutes. Gradual Rollout
Skipping spot requests can occur in these two scenarios:
- If the cluster configuration is heterogeneous, then QDS skips instance families for which Spot losses are seen when creating a Spot Fleet request. If all instances configured have seen spot loss, then QDS skips launching spot nodes and falls back to On-Demand nodes if this option is enabled. (The fallback to On-Demand nodes option is enabled by default.)
- If the cluster configuration is homogeneous and if the configured instance family has experienced Spot losses, then QDS skips launching spot nodes and falls back to On-Demand nodes if this option is enabled. (The fallback to On-Demand nodes option is enabled by default.)
Qubole recommends configuring instances of multiple families to maximize the Spot instances’ availability.
For details and an example, see Skipping Spot Instances with Spot Loss.
HADTWO-1162: The default values of the following HDFS options have been modified to speed up decommissioning of unused nodes:
dfs.namenode.replication.max-streams: Its default value is increased from 2 to 3.
dfs.namenode.replication.work.multiplier.per.iteration: Its default value is increased from 2 to 4.
dfs.namenode.decommission.interval: Its default value is reduced from 30 to 20.
dfs.namenode.decommission.nodes.per.interval: Its default value is increased from 5 to 20.
HADTWO-1745: QDS now supports running Hadoop2 clusters on Java8. Via Support
HADTWO-1903: For a pure spot node cluster, if the coordinator node goes down due to a spot loss event, the entire cluster is terminated immediately.
- HADTWO-1797: When using custom DNS servers, applications can sometimes get stuck or killed due to timeout in case of DNS bottlenecks. This fix prevents that by removing the reverse DNS lookup.
- HADTWO-1780: The open-source change, HDFS-3384 is ported to resolve the DFSClient bug that threw
java.io.EOFException: Premature EOF: no length prefix available.