Set Data Compression on Hadoop-1 Clusters

Data compression in Hadoop can speed up the input/output operations as Hadoop jobs are data-intensive. It saves data storage space and makes the data transfer faster over a network. However, there is an increase in CPU utilization and processing time when data is compressed and decompressed. Data Compression and the format used for compressing data have a considerable impact on MapReduce jobs’ performance.

Configuring various formats of data compression are as explained below:

  • gzip compression format - The file extension of this compression format is .gz. This format is not splittable. The following configuration is used to set this format:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
    
  • bzip2 compression format - The file extension of this compression format is .bz2. This format is splittable. The following configuration is used to set this format:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
    
  • lzo compression format - The file extension of this compression format is .lzo. This format is splittable if the compression is indexed. The following configuration is used to set this format:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
    
  • snappy compression format - The file extension of this compression format is .snappy. This format is splittable. The following configuration is used to set this format:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
    SET mapred.output.compression.type=BLOCK;
    

    Example (AWS)

    DROP TABLE IF EXISTS manager;
    CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv';
    DROP TABLE IF EXISTS manager_snappy;
    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
    SET mapred.output.compression.type=BLOCK;
    CREATE TABLE manager_snappy like manager;
    INSERT OVERWRITE TABLE manager_snappy
    SELECT * FROM manager;
    SELECT * FROM manager_snappy limit 3;
    
  • zlib/deflate compression format - It is the default data compression format. The file extension of this compression format is .deflate. This format is not splittable. The following configuration is used to set this format:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
    

    Example (AWS)

    DROP TABLE IF EXISTS manager;
    CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv';
    DROP TABLE IF EXISTS manager_zlib_is_default;
    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
    CREATE TABLE manager_zlib_is_default like manager;
    INSERT OVERWRITE TABLE manager_zlib_is_default
    SELECT * FROM manager;
    SELECT * FROM manager_zlib_is_default limit 3;