Using the AWS File System

Use filesystems for reading from and writing to Amazon S3 as described in these sub-topics:

Using Filesystems to Access AWS Cloud Storage

Qubole supports the native-S3n and S3a file system client connectors to access the AWS S3 cloud storage. Qubole supports multipart upload and move by default on Amazon S3. Amazon only allows a file with a maximum size of 5 GB in a single HTTP connection.

Using the Native S3n File System

By default, in Qubole clusters (except Presto), NativeS3FileSystem is being used to access S3 objects. It uses JetS3t to interact with AWS S3 endpoints. The URI scheme is s3n://. For enabling Secure Socket Layer on the S3n file system, see Enabling Secure Socket Layer.

Using the S3a File System

The S3aFileSystem is considered to be a successor to the NativeS3FileSystem. It uses AWS SDK for interacting with S3. Hence, the S3aFileSystem supports more S3 endpoints. It also supports Amazon v4 signature-based authentication. Qubole currently supports S3A Filesystem on all cluster types except Presto. The URI scheme is s3a://.

The S3aFileSystem is the default file system on Hadoop 2 clusters.

To enable S3aFileSystem, add the following configuration as Hadoop override parameters in a REST API call or UI cluster configuration.

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3n.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3n.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A
qubole.aws.use.v4.signature=true

Note

In a REST API request to create/edit a cluster, add this configuration in custom_config within hadoop_settings. Create a New Cluster provides more information.

Enable the automatic detection of a bucket endpoint in the S3a file system by setting fs.s3a.bucket.endpoint-detection.enable to true. The option is set to false by default.

Alternatively, you can create a ticket with Qubole Support for Qubole to enable automatic detection of a bucket endpoint in the S3a files system. After this feature is enabled, the S3a filesystem would not honor endpoint-specific properties such as fs.s3a.endpoint, qubole.s3.standard.endpoint, and fs.s3.awsBucketToRegionMapping.

QDS supports the Requester pays option is in the S3a files system. By default, the option is disabled. To enable it, set fs.s3a.requester-pays.enabled to true.

Enabling Secure Socket Layer

Secure Socket Layer (SSL) is not enabled by default on the Native S3n and S3a file systems. To enable it on the two file systems, you must:

  • Set fs.s3.https.only=true in the Native S3N file system.
  • Set fs.s3a.connection.ssl.enabled=true in the S3a file system.

In the UI cluster configuration page, add this configuration in Override Hadoop Configuration Variables under Hadoop Cluster Settings. Advanced Configuration: Modifying Hadoop Cluster Settings provides more information.

Using the Presto S3 File System

Presto clusters uses its own S3 filesystem, PrestoS3FileSystem to access Amazon cloud storage. Presto uses these for the URI schemes s3://, s3n:// and s3a://.

Using Multipart Upload and Streaming and Move

Uploading and moving objects larger than 5 GB to an Amazon S3 location is only possible by using multipart uploads. Qubole provides different options for enabling multipart uploads and multipart moves. Qubole uses the same configuration options for S3n and S3a filesystems. Qubole supports mutlipart uploads, multipart moves, and multipart streaming on Hadoop 2 and Spark clusters. However, only multipart uploads are supported on Presto clusters using different configuration options as mentioned in catalog/hive.properties.

Using Multipart Upload

Multipart upload enables you to upload objects larger than 5 GB in several parts.

fs.s3n.multipart.uploads.enabled is the property used to enable and disable multipart upload, which is enabled by default.

In addition, fs.s3n.multipart.uploads.maxpartsize.mb controls the object’s part size. Its default value is 500. (That is by default, large objects are uploaded in chunks of 500 MB.)

Using Multipart Streaming

Multipart streaming can speed up multipart uploads by uploading data chunks concurrently instead of serial data uploads. This can speed up writes of large objects to S3. Multipart streaming can be enabled only if multipart uploads are enabled. For more information, see Using Multipart Upload.

The following properties are associated with the multipart streaming:

  • By default, fs.s3n.multipart.uploads.streaming.enabled is disabled. Set it to true for enabling multipart streaming.
  • fs.s3n.multipart.streaming.uploads.maxpartsize.mb is the property to control the part size. Its default value is 5.
  • fs.s3n.multipart.uploads.concurrency.factor is the property to control the number of concurrent parts that can be uploaded. Its default value is 1.

QDS supports BlockOutputStream in the S3a filesystem. To enable blockoutputstream, set fs.s3a.fast.upload to true. It is an output stream mechanism in which large files/streams are uploaded in the form of blocks with the size set by fs.s3a.multipart.size. These blocks can be buffered on disk, array (on JVM Heap memory) and byte buffers. The buffering mechanism can be set using using property fs.s3a.fast.upload.buffer. Its valid values are: disk, array, and bytebuffer. The default value is disk.

If the block output type is array or bytebuffer, then there is a limit on how many buffers can be queued for upload. The queue size is 15 by default. This is to ensure the memory footprint of default is less. The queue size is equal to fs.s3a.max.total.tasks and therefore can be configured with higher value for large JVMs.

Using Multipart Move

Multipart move is required to move objects larger than 5 GB from one Amazon S3 location to another Amazon S3 location. The fs.s3n.multipart.move.enabled property associated with the multipart move, is enabled by default.