Enabling Data Encryption in QDS (AWS)

QDS supports data encryption to protect data when the data is on AWS S3 and when it is on ephemeral HDFS. These topics explain how to enable data encryption:

Enabling Client-side Encryption

Qubole supports the AWS Key Management Service (KMS) client-side encryption only on the S3a filesystem. It is supported on Hadoop 2 and Spark clusters. It can be used to encrypt/decrypt data.

Qubole supports AWS KMS client-side encryption at account and cluster levels. If the service is enabled at an account level, it gets enabled on all Hadoop 2 and Spark clusters of that QDS account.

Note

The AWS KMS client-side encryption feature is available for beta access. To enable it on a QDS account or in a specific Hadoop 2/Spark cluster, create a ticket with Qubole Support.

Since AWS KMS is only supported on the S3a filesystem, the account must have the S3a filesystem. Ensure that the Amazon S3 bucket and the AWS KMS key are in the same AWS region because the key of one AWS region is not recognized in other AWS regions.

Enabling Server-side Encryption

Qubole leverages Amazon S3’s server-side encryption (SSE). For more information, see this reference.

Server-side Encryption on AWS S3

To enable SSE in S3n filesystems, set the following property:

fs.s3n.sse=AES256

To enable this server-side encryption in S3a filesystems, use the fs.s3a.server-side-encryption-algorithm property and these are its supported values:

  • AES256 (for SSE-S3)
  • SSE-KMS
  • SSE-C

Note

When SSE-KMS or SSE-C are enabled in QDS, any command running with these settings may not be able to fetch the result data. As such, these settings must only be used when results are irrelevant (for example, populating data into a directory in S3 using a Spark or a Hive job). KMS and Customer Provided Keys Server-side Encryption on AWS S3 provides more information.

The SSE can be set in clusters, Hive bootstrap and command. But Presto honors the open source server-side encryption as described in Server-side Encryption in Presto.

KMS and Customer Provided Keys Server-side Encryption on AWS S3

QDS supports SSE-KMS and SSE-Customer Provided Keys (SSE-C) only on the S3a filesystem. For details on the client-side KMS encryption, see Enabling Client-side Encryption.

Note

When SSE-KMS or SSE-C are enabled in QDS, any command running with these settings may not be able to fetch the result data. As such, these settings must only be used when results are irrelevant (for example, populating data into a directory in S3 using a Spark or a Hive job).

Set the following properties to use the SSE-KMS and SSE-C encryption:

  • fs.s3a.server-side-encryption-algorithm: It is not set by default. Set it to one of these supported values:

    • AES256 (for SSE-S3)
    • SSE-KMS
    • SSE-C
  • fs.s3a.server-side-encryption.key: Its value specifies the encryption key to use if fs.s3a.server-side-encryption-algorithm has been set to SSE-KMS or SSE-C. These conditions apply to this property:

    • In case of SSE-C, the value of this property must be the Base64 encoded key.
    • If you are using SSE-KMS and leave this property empty, you would be using your default S3 KMS key. Otherwise, you must set this property to the specific KMS key ID.

Server-side Encryption in Clusters

(Navigate to the Control Panel page. In the Clusters tab, click the Edit to go to the Edit Cluster page or New to go the a new cluster page. set at the Add/Edit Cluster page, Overide Hadoop Configuration Variables. Set the override as:

  • fs.s3n.sse=AES256 in S3n filesystems.
  • fs.s3a.server-side-encryption-algorithm=<value> in S3a filesystems. The values can be AES256, SSE-KMS, or SSE-C.

Note

When SSE-KMS or SSE-C are enabled in QDS, any command running with these settings may not be able to fetch the result data. As such, these settings must only be used when results are irrelevant (for example, populating data into a directory in S3 using a Spark or a Hive job).

Server-side Encryption in Hive

Set it as a Hive bootstrap setting. This would affect all Hive commands for a given account. Use this syntax:

  • set fs.s3n.sse=AES256 on S3n filesystems.
  • set fs.s3a.server-side-encryption-algorithm=<value> in S3a filesystems. The values can be AES256, SSE-KMS, or SSE-C.

Similarly, the same syntax is applicable on Hive commands, which is set per command and in the same command session as the command.

For example,

CREATE EXTERNAL TABLE New2 (`Col0` STRING, `Col1` STRING, `Col2` STRING) PARTITIONED BY (`20100102` STRING,`IN` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ap-dev-qubole/common/hive/30day_1/30daysmall'; set fs.s3n.sse=AES256;

Server-side Encryption in Presto

As a Presto catalog/hive.properties setting, set hive.s3.sse.enabled=true. See catalog/hive.properties for more information.

Note

The results of the select calls with the limit clause are not encrypted as the limit clause would result in bypassing of the map/reduce flow.

Results of select calls without the limit clause are encrypted. Basically, a standard Hadoop map/reduce output is encrypted. A Presto output, which does not use map/reduce is not encrypted.

Server-side Encryption while using Hadoop DistCp

While using Hadoop DistCp, these parameters can be set for server-side encryption along with the other parameters:

  • s3ServerSideEncryption: It enables encryption of data at the object level as S3 writes it to disk.
  • s3SSEAlgorithm: It is used for encryption. If you do not specify it but s3ServerSideEncryption is enabled, then AES256 algorithm is used by default. Valid values are AES256, SSE-KMS and SSE-C.
  • encryptionKey: If SSE-KMS or SSE-C is specified in the algorithm, then using this parameter, you can specify the key using which the data is encrypted. In case the algorithm is SSE-KMS, the key is not mandatory as AWS KMS would be used. If algorithm is SSE-C, then specify the key else the job fails.

Enabling Encryption on Ephemeral Data in QDS Clusters

The ephemeral HDFS brought up on the EC2 compute nodes. To enable encryption on the ephemeral drives through a Cluster REST API, see security_settings.

Navigate to the Clusters page, click the edit button to go to the Edit Cluster page.

Select Enable Encryption listed below Security Settings in Advanced Configuration as shown in the following figure.

../_images/EnableEncrypt.png

Enable Encryption is an option to encrypt the data at rest on the node’s ephemeral (local) storage. This includes HDFS and any intermediate output generated by Hadoop. The block device encryption is setup before the node joins the cluster and can increase the bring up time of the cluster.