Using the AWS File System¶
Use filesystems for reading from and writing to Amazon S3 as described in these sub-topics:
Using Filesystems to Access AWS Cloud Storage that covers:
Using Multipart Upload and Streaming and Move that covers:
Using Filesystems to Access AWS Cloud Storage¶
Qubole supports the native-S3n and S3a file system client connectors to access the AWS S3 cloud storage. Qubole supports multipart upload and move by default on Amazon S3. Amazon only allows a file with a maximum size of 5 GB in a single HTTP connection.
Using the Native S3n File System¶
By default, in Qubole clusters (except Presto), NativeS3FileSystem is being used to access S3 objects. It uses JetS3t to
interact with AWS S3 endpoints. The URI scheme is
s3n://. For enabling Secure Socket Layer on the S3n file system,
see Enabling Secure Socket Layer.
Using the S3a File System¶
The S3aFileSystem is considered to be a successor to the NativeS3FileSystem. It uses AWS SDK for interacting with S3. Hence,
the S3aFileSystem supports more S3 endpoints. It also supports Amazon v4 signature-based authentication.
Qubole currently supports S3A Filesystem on all cluster types except Presto. The URI scheme is
The S3aFileSystem is the default file system on Hadoop 2 clusters.
To enable S3aFileSystem, add the following configuration as Hadoop override parameters in a REST API call or UI cluster configuration.
fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3n.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A fs.AbstractFileSystem.s3n.impl=org.apache.hadoop.fs.s3a.S3A fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A qubole.aws.use.v4.signature=true
In a REST API request to create/edit a cluster, add this configuration in
hadoop_settings. Create a New Cluster provides more information.
Enable the automatic detection of a bucket endpoint in the S3a file system by setting
true. The option is set to
false by default.
Alternatively, you can create a ticket with Qubole Support for Qubole to enable
automatic detection of a bucket endpoint in the S3a files system. After this feature is enabled, the S3a filesystem would
not honor endpoint-specific properties such as
QDS supports the Requester pays option is in the S3a files system. By default, the option is disabled. To enable it,
Enabling Secure Socket Layer¶
Secure Socket Layer (SSL) is not enabled by default on the Native S3n and S3a file systems. To enable it on the two file systems, you must:
fs.s3.https.only=truein the Native S3N file system.
fs.s3a.connection.ssl.enabled=truein the S3a file system.
In the UI cluster configuration page, add this configuration in Override Hadoop Configuration Variables under Hadoop Cluster Settings. Advanced Configuration: Modifying Hadoop Cluster Settings provides more information.
Using the Presto S3 File System¶
Presto clusters uses its own S3 filesystem, PrestoS3FileSystem to access Amazon cloud storage. Presto uses these for the
Using Multipart Upload and Streaming and Move¶
Uploading and moving objects larger than 5 GB to an Amazon S3 location is only possible by using multipart uploads. Qubole provides different options for enabling multipart uploads and multipart moves. Qubole uses the same configuration options for S3n and S3a filesystems. Qubole supports mutlipart uploads, multipart moves, and multipart streaming on Hadoop 2 and Spark clusters. However, only multipart uploads are supported on Presto clusters using different configuration options as mentioned in catalog/hive.properties.
Using Multipart Upload¶
Multipart upload enables you to upload objects larger than 5 GB in several parts.
fs.s3n.multipart.uploads.enabled is the property used to enable and disable multipart upload, which is enabled by default.
fs.s3n.multipart.uploads.maxpartsize.mb controls the object’s part size. Its default value is 500.
(That is by default, large objects are uploaded in chunks of 500 MB.)
Using Multipart Streaming¶
Multipart streaming can speed up multipart uploads by uploading data chunks concurrently instead of serial data uploads. This can speed up writes of large objects to S3. Multipart streaming can be enabled only if multipart uploads are enabled. For more information, see Using Multipart Upload.
The following properties are associated with the multipart streaming:
- By default,
fs.s3n.multipart.uploads.streaming.enabledis disabled. Set it to true for enabling multipart streaming.
fs.s3n.multipart.streaming.uploads.maxpartsize.mbis the property to control the part size. Its default value is 5.
fs.s3n.multipart.uploads.concurrency.factoris the property to control the number of concurrent parts that can be uploaded. Its default value is 1.
QDS supports BlockOutputStream in the S3a filesystem. To enable blockoutputstream, set
true. It is an output stream mechanism in which large files/streams are uploaded in the form of blocks with the size
fs.s3a.multipart.size. These blocks can be buffered on disk, array (on JVM Heap memory) and byte buffers. The
buffering mechanism can be set using using property fs.s3a.fast.upload.buffer. Its valid values are:
bytebuffer. The default value is
If the block output type is
bytebuffer, then there is a limit on how many buffers can be queued for upload.
The queue size is 15 by default. This is to ensure the memory footprint of default is less. The queue size is equal to
fs.s3a.max.total.tasks and therefore can be configured with higher value for large JVMs.
Using Multipart Move¶
Multipart move is required to move objects larger than 5 GB from one Amazon S3 location to another Amazon S3 location.
fs.s3n.multipart.move.enabled property associated with the multipart move, is enabled by default.