Understanding Cluster Network Security Characteristics (AWS)

Each cluster is associated with a unique security group which acts as a virtual firewall that controls the traffic for the nodes within the cluster. The ports that allow inbound traffic are:

  • Ports that allow inbound traffic from Qubole’s security group sg-a8c407c0 in the AWS us-east-1 region are:

    • Port 22, which is the SSH port.

      Note

      Hadoop 2 requires only port 22 to allow inbound traffic from Qubole’s security group.

    • Port 9000, which is the NameNode port

    • Port 50070, which is the NameNode web port

    • Port 50075, which is the DataNode web port

    • Port 8081, which is the Presto server port

    • Port 8443, which is the HTTPs port for a Presto server

    • Port 8082, which is the Zeppelin server port

    • Port 18080, which is the Spark History Server port

  • Port 22 allows inbound traffic from:

    • Qubole’s security group sg-a8c407c0 in the AWS us-east-1 region and the EC2-classic platform.

    Note

    CIDR 0.0.0.0/0 (world) for all other cases. (These include clusters in an AWS VPC, including the default VPC in the us-east-1 region, and AWS regions other than us-east-1).

    Create a ticket with Qubole Support if you want to restrict SSH port (port 22) access to limited IP addresses. For more information, see Creating a Security Group in the VPC.

  • Within a cluster, participating nodes can communicate with each other on all ports.

Configuring a Cluster Proxy

Configuring a cluster proxy ensures that the outbound data traffic from clusters reaches Qubole Control Plane as well as other required services such as EC2/S3.

As a prerequisite, create a ticket with Qubole Support to enable the proxy configuration for the account. In addition, Qubole requires the following from you:

  • Provide a proxy server URL in this form: <my-squid-proxy.domain>:<port>

  • Domain names, URLs, and IP addresses that must bypass the proxy server when connected from the cluster nodes. The default value includes 169.254.169.254, 127.0.0.1, localhost and S3 endpoints.

    Note

    Qubole recommends configuring S3 VPC endpoint as described in endpoints for Amazon S3. This helps reduce the load on the proxy server and also ensures that the traffic to S3 from the cluster nodes does not go outside the AWS network.

  • The proxy server protocol to use if the proxy server does not support both http and https protocols.

Note

Ensure to provide a persistent security group for Qubole clusters when you configure the outbound communication from the cluster nodes to pass through an Internet proxy server. You can configure a persistent security group in the Advanced Configuration tab of that cluster’s UI as described in Advanced Configuration: Modifying Security Settings (AWS).

Configuring Outbound Endpoints for Proxy Server

For a proxy server setup, you must allow access to the following endpoints:

  1. Allow *.qubole.com to ensure that the outbound data traffic from Qubole clusters reaches the Qubole Control Plane.
  2. Allow https://***.cloudfront.net/i.
  3. Allow *.amazonaws.com to ensure that the outbound data traffic from Qubole clusters reaches *.amazonaws.com to invoke EC2 API calls from the cluster nodes.
  4. Allow https://app.datadoghq.com.

Additional Endpoints for Jupyter and Zeppelin Notebooks

You must also allow access to any maven coordinates that are defined in notebooks’ interpreter settings.

Additional Endpoints for Package Management

Allow access to these endpoints for pip/Conda packages:

Allow access to these endpoints for CRAN packages:

Additional Endpoints

If you use a public git repo as a PyPI package (pip install), allow access to git URLs (GitHub/GitLab/Bitbucket). You must also allow access to web URLs that are added in node bootstraps.