April 18, 2011

Compressing Hadoop Output usinig Gzip and Lzo

In most of the cases, writing out output files in compressed format is faster - less amount of data will be written. To have a faster computation, compression algorithm should perform well - so time is saved even though there is an extra compression time overhead.

Compressing regular output formats with Gzip, use:
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
...

For Lzo Output compression, download this package by @kevinweil. Then following should work:
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);
...

In terms of space efficiency, Gzip compresses better. However, in terms of time Lzo i smuch faster. Also, it is possible to split Lzo files, splittable Gzip is not available.
Keep in mind that these two techniques will only compress the final outputs of a Hadoop job. To be able to compress intermediate data, parameters in mapred-site.xml should be configured.