Enable S3 Listing and Wild Card Optimization

Listing files in the S3 location can be a slow process that can be optimized using a configuration option. Similarly, listing directories that contains wildcards can be also slow that can be made faster using a configuration option.

Enable S3 Listing Optimization

As part of the split computation, the Hive must list all files in the table’s S3 location. Hadoop jobs must list all files in an S3 location. The implementation in Apache Hadoop for listing files in S3 is very slow. Optimizations have been incorporated to speed this up by setting fs.s3.inputpathprocessor=true by default for Hive queries and Hadoop jobs.

Enable WildCard Optimization

There are two forms of wildcard character: asterisk (*) that matches zero or more arbitrary characters and question mark (?) that matches exactly one arbitrary character.

While listing directories that contain an asterisk, for example, s3://my-bucket/my-dir/*/file.csv, the process may become slow. To speed up the listing process, set mapred.job.natives3filesystem.globstatus.use to true.