Submit a Hadoop S3DistCp Command¶

POST /api/v1.2/commands/¶

Hadoop DistCP is the tool used for copying large amount of data across clusters. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS).

In Qubole context, if you are running mutiple jobs on the same datasets, then S3DistCp can be used to copy large amounts of data from S3 to HDFS. Subsequent jobs can now point to the data in HDFS location directly. You can also use S3DistCp to copy the data from HDFS to S3. For more details on S3DistCp, see here.

Ensure that the output directory is new and does not exist before running a Hadoop job.

Required Role¶

The following users can make this API call:

Users who belong to the system-user or system-admin group.
Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.

Parameters¶

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter	Description
command_type	HadoopCommand
sub_command	s3distcp
sub_command_args	[hadoop-generic-options] [s3distcp-arg1] [s3distcp-arg2] ...
src	Location of the data on HDFS or Amazon S3 location, to copy. Important S3DistCp does not support the Amazon S3 bucket names with the underscore (_) character.
dest	Destination path for the copied data on HDFS or Amazon S3 location. Important S3DistCp does not support the Amazon S3 bucket names with the underscore (_) character.
label	Specify the cluster label on which this command is to be run.
name	Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags	Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, `{"tags":["<tag-value>"]}`.
macros	Denotes the macros that are valid assignment statements containing the variables and its expression as: `macros: [{"<variable>":<variable-expression>}, {..}]`. You can add more than one variable. For more information, see Macros.
srcPattern	A regular expression that filters the copy operation to a data subset at the `src`. If you specify neither `srcPattern` nor `groupBy`, all data from `src` is copied to `dest`. If the regular expression contains special characters such as an asterisk (*), either the regular expression or the entire `args` string must be enclosed in single quotes (‘).
groupBy	A regular expression that causes S3DistCp to concatenate files that match the expression. For example, you could use this option to combine log files written in one hour into a single file. The concatenated filename is the value matched by the regular expression for the grouping. Parentheses indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file. If the regular expression does not include a parenthetical statement, the cluster fails on the S3DistCp step and returns an error. If the regular expression argument contains special characters, such as an asterisk (*), either the regular expression or the entire `args` string must be enclosed in single quotes (‘). When `groupBy` is specified, only files that match the specified pattern are copied. You must not specify `groupBy` and `srcPattern` at the same time.
targetSize	The size, in mebibytes (MiB), of the files to create based on the `groupBy` option. This value must be an integer. When it is set, S3DistCp attempts to match this size; the actual size of the copied files may be larger or smaller than this value. Jobs are aggregated based on the size of the data file, hence, it is possible that the target file size will match the source data file size. If the files concatenated by `groupBy` are larger than the value of `targetSize`, they are broken up into part files, and named sequentially with a numeric value appended to the end. For example, a file concatenated into file.gz would be broken into parts as: file0.gz, file1.gz, and so on.
outputCodec	It specifies the compression codec to use for the copied files. This can take the values: `gzip`, `gz`, `lzo`, `snappy`, or none. You can use this option, for example, to convert input files compressed with Gzip into output files with LZO compression, or to uncompress the files as part of the copy operation. If you choose an output codec, the filename is appended with the appropriate extension (for example, for gz and gzip, the extension is .gz) If you do not specify a value for `outputCodec`, the files are copied over with no change in the compression.
s3ServerSideEncryption	It ensures that the target data is transferred using SSL and automatically encrypted in Amazon S3 using an AWS service-side key. When retrieving data using S3DistCp, the objects are automatically unencrypted. If you try to copy an unencrypted object to an encryption-required Amazon S3 bucket, the operation fails.
deleteOnSuccess	If the copy operation is successful, this option makes S3DistCp delete copied files from the source location. It is useful if you are copying output files, such as log files, from one location to another as a scheduled task, and you do not want to copy the same files twice.
disableMultipartUpload	It disables the use of multipart upload.
encryptionKey	If SSE-KMS or SSE-C is specified in the algorithm, then using this parameter, you can specify the key using which the data is encrypted. In case the algorithm is `SSE-KMS`, the key is not mandatory as AWS KMS would be used. If algorithm is `SSE-C`, then specify the key else the job fails.
filesPerMapper	It is the value that denotes the number of files that is placed in each map task.
multipartUploadChunkSize	The size, in MiB is the multipart upload part size. By default, it uses multipart upload when writing to Amazon S3. The default chunk size is 16 MiB.
numberFiles	It prepends output files with sequential numbers. The count starts at 0 unless a different value is specified by `startingIndex`.
startingIndex	It is used with `numberFiles` to specify the first number in the sequence.
outputManifest	It creates a text file, compressed with Gzip, that contains a list of all files copied by S3DistCp.
previousManifest	It reads a manifest file that was created during a previous call to S3DistCp using the `outputManifest`. When `previousManifest` is set, S3DistCp excludes the files listed in the manifest from the copy operation. If `outputManifest` is specified along with `previousManifest`, files listed in the previous manifest also appear in the new manifest file, even though the files are not copied.
copyFromManifest	It reverses the `previousManifest` behavior to cause S3DistCp to use the specified manifest file as a list of files to copy, instead of a list of files to exclude from copying.
s3Endpoint	It specifies the Amazon S3 endpoint to use when uploading a file. This option sets the endpoint for both the source and destination. If not set, the default endpoint is s3.amazonaws.com. For a list of Amazon S3 endpoints, see Endpoints.
s3SSEAlgorithm	It is used for encryption. If you do not specify it but `s3ServerSideEncryption` is enabled, then AES256 algorithm is used by default. Valid values are `AES256`, `SSE-KMS`, and `SSE-C`.
srcS3Endpoint	It is an AWS S3 endpoint to specify as the source path.
timeout	It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
tmpDir	It is the location (path) where files are stored temporarily when they are copied from the cloud object storage to the cluster. The default value is `hdfs:///tmp`.

Example¶

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "s3distcp", "sub_command_args": "--src s3://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize", "command_type": "HadoopCommand"}' \
"https://api.qubole.com/api/v1.2/commands"

Note

The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

Remember to change the source folder before running the above command. Source folder must be some S3 location, which is accessible using your AWS credentials.

Sample Response

{
    "id": 18167,
    "meta_data": {
        "logs_resource": "commands/18167/logs",
        "results_resource": "commands/18167/results"
    },
    "command": {
        "sub_command": "s3distcp",
        "sub_command_args": "--src s3://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize"
    },
    "command_type": "HadoopCommand",
    "created_at": "2013-03-14T09:34:15Z",
    "path": "/tmp/2013-03-14/53/18167",
    "progress": 0,
    "qbol_session_id": 3525,
    "qlog": null,
    "resolved_macros": null,
    "status": "waiting"
}