Enable S3 Listing and Wild Card Optimization
Listing files in the S3 location can be a slow process that can be optimized using a configuration option. Similarly, listing directories that contains wildcards can be also slow that can be made faster using a configuration option.
Enable S3 Listing Optimization
As part of the split computation, the Hive must list all files in the table’s S3 location. Hadoop jobs must list all
files in an S3 location. The implementation in Apache Hadoop for listing files in S3 is very slow. Optimizations have
been incorporated to speed this up by setting fs.s3.inputpathprocessor=true
by default for Hive queries and Hadoop jobs.
Enable WildCard Optimization
There are two forms of wildcard character: asterisk (*
) that matches zero or more arbitrary characters and
question mark (?
) that matches exactly one arbitrary character.
While listing directories that contain an asterisk, for example, s3://my-bucket/my-dir/*
/file.csv, the process
may become slow. To speed up the listing process, set mapred.job.natives3filesystem.globstatus.use
to true.