Troubleshooting AWS Cluster Startup Failures

This topic explains how to troubleshoot the cluster startup failures in these sub-sections:

Viewing the Cluster Start Logs

To see the cluster start logs of a running cluster, navigate to the Clusters page and click Resources. It displays a list of resources from which you must click Cluster Start Logs. An example of the cluster resources’ list is as shown here.

../../_images/ClusterResourcesList1.png

When you click Cluster Start Logs, the cluster logs are displayed from which you can see the detailed logs of the cluster instances. An example of the cluster start logs page is as shown here.

../../_images/ClusterStartLogs1.png

Diagnosing and Fixing Problems

The table that follows lists some common error messages that may be logged when a cluster fails to start, describes the underlying causes, and provides remedies:

Note

In the case of errors passed directly through from AWS, you may also want to consult the AWS Error codes page (these errors appear in the table as single strings, such as VPCIdNotSpecified).

  • Error message text: You are not authorized to perform this operation.

  • Error message text: Hadoop Bring up failed: invalid literal for int ()

    • Cause: There is some error in the subnet routing; someone in your organization may have edited the routing table for a private subnet and prevented access from Qubole.
    • What to do: Edit the routing table attached to the subnet to allow an SSH connection over TCP port 22 from QDS tunnel servers.
  • Error message text: Cluster start failed. Hadoop bring-up failed.

    • Cause: Possible causes are:
      • The identities of the DNS servers for your cluster are not valid; someone in your organization may have cloned the Virtual Private Cloud (VPC) used by the cluster and changed DHCP options (see Changing DHCP Options).

        • What to do: In that case, configure your AWS DHCP options to include the Amazon Provided DNS (AmazonProvidedDNS).
      • You have created an IAM Role (Instance Profile) which is used by cluster nodes.

        • What to do: You should confirm that AssumeRole exists in the Trust Relationships policy for the IAM Role:

          {
            "Effect": "Allow",
            "Principal": { "Service": "ec2.amazonaws.com" },
            "Action": "sts:AssumeRole"
          }
          

          If the above policy elements are missing, then add it to the IAM Role’s Trust Relationships policy and save the IAM Role.

      • You have specified that the cluster should use a persistent security group and someone has edited the security group manually, introducing an error such that QDS does not have the access it needs.

        • What to do: In that case, set security group rules such that:

          • TCP/UDP/ICMP protocols allow instances within the group to have access to all the ports in the range 0 - 65535; and
          • TCP port 22 is open, to allow access by the Qubole Tier.

          See Custom Security Groups.

  • Error message text: No master node found

    • Cause: The likely cause is that you configured the cluster master node as an instance that has limited storage capacity (such as c3 types), and in QDS you specified an EBS volume count for the master node (or accepted a default) that causes your AWS account to exceed its EBS volume limit. In this case, the master node comes up, but then immediately shuts down (this behavior is controlled by AWS) and hence QDS can’t find the master node.
    • What to do:
      • Either: If the volume count represents the amount of storage the master really needs, ask AWS to increase the EBS service limit
      • or: If it’s practical to do so, reduce the volume count so that it does not exceed your service limit. To do this, click the edit icon for this cluster on the Clusters page in the QDS Control Panel, and then change the EBS volume count.
  • Error message text: No default VPC for this user or VPCIdNotSpecified

    • Cause: You have specified a VPC (Virtual Private Cloud) for the cluster in the QDS Control Panel but the VPC does not exist; or you are using an AWS account that requires a VPC, but no default VPC exists and you have not specified any other VPC in the Control Panel.

    • What to do: Specify a valid VPC in the Control Panel. If no VPC exists:

      1. Create a new VPC in AWS.
      2. Edit the cluster information in the QDS Control Panel to reflect the new VPC.

      Make sure you read and follow the VPC documentation listed under Preventing Problems.

  • Error message text: The requested Availability Zone is currently constrained and we are no longer accepting new customer requests for t1/m1/c1/m2/m3 instance types. Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1d, us-east-1a.

    • Cause: AWS is currently not allowing new instances of the type requested in the Availability Zone requested.
    • What to do: In the QDS Control Panel, change the AWS Availability Zone either to No Preference or to one of the zones in the error message. (Choosing No Preference could increase your cost if you are using AWS Reserved Instances, because these are tied to a specific Availability Zone.)
  • Error message text: There are not enough free addresses in subnet <subnet_id> to satisfy the requested number of instances.

    • Cause: The subnet you have defined in AWS has run out of addresses.
    • What to do: Add a new subnet to your VPC with enough free IP addresses (you can’t expand an existing subnet).
  • Error message text: InvalidSubnetID.NotFound

    • Cause: The subnet this cluster is assigned to does not exist; someone in your organization may have deleted it.

    • What to do: Make sure that any subnet associated with this cluster actually exists.(Look in the EC2 Settings section of the Edit Cluster page in the QDS Control Panel.)

      Make sure you read and follow the VPC and subnet documentation listed under Preventing Problems.

  • Error message text: InsufficientCapacityException

    • Cause: AWS was not able to provision the cluster because there weren’t enough instances available of the type you requested, in the AWS Availability Zone you specified.

    • What to do: In the QDS Control Panel, change the AWS Availability Zone to No Preference to allow AWS to select the zone that has the most available capacity. (This will increase your cost if you are using Reserved Instances, because these are tied to a specific Availability Zone.)

      If it’s practical to do so, you could also try reducing the initial size of the cluster (Minimum Worker Count), or changing the instance type. For more information, see Troubleshooting Instance Capacity.

  • Error message text: InsufficientInstanceCapacity

    • Cause: AWS was not able to provision the cluster because there weren’t enough instances available of the type you requested.
    • What to do: Try reducing the initial size of the cluster, or changing the instance type. See also InsufficientCapacityException above.
  • Error message text: ``InstanceLimitExceeded. Your quota allows for 0 more running instance(s). You requested at least <n>.

    • Cause: You have exceeded your AWS quota for this instance type. (Remember that the quota applies to all of your instances, not just the Qubole instances.
    • What to do:
  • Error message text: PendingVerification

    • Cause: AWS has not yet verified your account.
    • What to do: If you have only recently created new credentials, wait and try again. Otherwise contact AWS Support.
  • Error message text: Unsupported

    • Cause: Possible causes should be outlined in the accompanying (AWS) error message.

    • What to do: If the message mentions that the Availability Zone is constrained, see the entry about Availability Zones earlier in this list.

      Otherwise, create a Qubole Support ticket.

  • Error message text: Exception: proxy_port or RM / JT curl check failed on master

    !!! 2018-11-03 21:34:10,099 ERROR - An error occurred while running hustler script

    • Cause: The health check on the master node has failed. This error not only occurs during the cluster start up but it can be the cause for any Hive/Spark/Hadoop job failure.
    • What to do: Perform these steps:
      1. Verify if the master node is reachable.
      2. Verify if the ResourceManager (RM) is working on the master node.
      3. Verify if the worker node can connect to the master node/ResourceManager. You can check the NodeManager logs located on S3 at <defloc>/logs/hadoop/<cluster_id>/<instance_id>/<any_node>/yarn/yarn-nodemanager-ip-<ip>.log.

Preventing Problems

Here are some guidelines to help you prevent similar problems in the future.

  • Make sure you’ve read and understood the relevant Qubole and Cloud documentation, in particular:

  • Make sure your cluster routing table is set up as described under Understanding Cluster Network Security Characteristics (this is the default configuration). In particular, if you make changes, make sure TCP port 22 remains open to allow SSH connections from QDS tunnel servers.

  • Use the default security configuration if possible. If it’s not possible for your organization to use the default configuration, make sure you thoroughly understand the issues described in the links on this page and know how to proceed. If you need a custom security configuration and are not fully confident you can configure it correctly, contact Qubole Support before you start.

  • If you move the cluster from EC2-Classic to a VPC, or vice versa, make sure you delete the old security group. (If the old group exists with the same name, QDS will use it and the cluster will not start.)

  • Be careful if you delete the default VPC for your AWS account. When you create a VPC account, or move a cluster to one, AWS creates a default VPC for you. If you delete that VPC, make sure you specify another, valid, VPC in the QDS Control Panel.

  • Make sure that starting the cluster will not put you over the AWS quota for the instance type you’ve chosen. (Remember that the quota applies to all of your instances, not just the Qubole instances.)

  • Make sure your AWS credentials, and IAM credentials if any, are properly configured.

  • If you use AWS instances with limited local storage (such as c3), be careful about configuring Elastic Block Store (EBS) volumes for the cluster (EBS volume count on the Add New Cluster page of the QDS Control Panel):

    • If you don’t configure enough storage, jobs may run slowly or fail.
    • If you configure too much, you could possibly exceed your AWS account’s volume limit. (You can request an increase if necessary.)