Understanding QDS Network Perimeter Security¶
Qubole offers features that provide good network perimeter security while performing data analysis. Here is a data flow diagram of QDS.
Whitelisting IP Addresses¶
In general, Qubole endpoints are accessible from anywhere worldwide through HTTPs and that applies to browser-based access as well as API-based access. The access can be limited to specific IP addresses. A common way of arranging this is, for example, to put all machines that may access Qubole in a private network/VPN and whitelisting the IP address of the NAT gateway so that only members logged into the VPN can access Qubole.
Whitelisting one or more IP address is a feature that Qubole offers if you want a single or a fixed number of IP addresses to be used for accessing the QDS account.
Create a ticket with Qubole Support to enable this feature for the QDS account.
Once you add IP addresses to whitelist, logging in to QDS is possible only from the whitelisted IP addresses.
For more information, see Whitelisting IP Addresses.
Securing with HTTP over SSL¶
Qubole now supports only HTTPS. All HTTP requests are now redirected to HTTPS. This is aimed at better security for Qubole users. It is applicable to all the Clouds that QDS supports.
Securing Data Traffic on the Cluster Nodes¶
Each cluster is associated with a unique security group which acts as a virtual firewall that controls the traffic for the nodes within the cluster. A security group can be configured at account and cluster levels. For more information, see:
Encrypting Data At Rest¶
- For AWS, Qubole encrypts data at rest on S3 to prevent unauthorized access to the S3 data. You can enable server-based encryption as described in Enabling Server-side Encryption in QDS (AWS). Enabling Client-side Encryption (AWS) describes how to enable encryption on the AWS client side.
- On Azure, data at rest is encrypted by default; see Encryption for Data at Rest on Azure.
Encrypting Data Traffic to AWS S3¶
Qubole supports encrypting data transit to S3 in different types of clusters as mentioned below:
- Airflow: Data traffic is not applicable to an Airflow clusters.
- Hadoop 2 (Hive) and Spark: set fs.s3a.connection.ssl.enabled=true as an Hadoop Override to encrypt data transit to an AWS S3 location.
- Presto: Set
trueto secure the communication between Amazon S3 and the Presto cluster using SSL.
Encrypting Data Traffic Among Cluster Nodes¶
Qubole supports encrypting data traffic among cluster nodes in different types of cluster as mentioned below:
- Airflow: As Airflow is a single-node cluster, data traffic among cluster nodes is not applicable to it.
- Hadoop 2 (Hive) and Presto: Encrypting Communication within a Presto Cluster describes how to encrypt the data among Hadoop 2 (Hive) cluster nodes or Presto cluster nodes.
- Spark: Encrypting and Authenticating Spark Data in Transit describes how to encrypt the data in transit on Spark cluster nodes.
Isolating from Virtual Networks through AWS Virtual Private Clouds¶
An AWS-VPC allows you to customize the network configuration and thus you have complete control over the virtual network. You can use IPv4 and IPv6 to securely and easily access resources and applications.Qubole supports configuring clusters in an AWS Virtual Private Cloud (VPC). It also supports AWS VPCs with private and public subnets. Configuring clusters in an AWS VPC with private and public subnets ensures the data security that is processed on the QDS platform. You can secure the data export/import from/to the QDS platform in VPCs.
For more information on how to configure a cluster in an AWS VPC, see Configuring a Cluster in a VPC with Public and Private Subnets (AWS). Unless you open the SSH port to the world, you must use tunnels to communicate. For details, see Securing through SSH Tunnelling.
Securing through SSH Tunnelling¶
Enable Qubole tunnel server settings on the cluster when it is in a VPC unless you want to open the SSH port to the world. Tunnelling with Bastion Nodes for Private Subnets in an AWS VPC lists the IP addresses of the Qubole tunnel servers. Once tunnelling is enabled on the cluster, it is automatically used for data export/import and running commands and so on as before.
It is highly recommended to use a tunnel and not open SSH to the world.
For more information on the ports that allow inbound traffic, see Understanding Cluster Network Security Characteristics (AWS).