Recently I have been playing around with HBase for a project that will need to store billions of rows (long scale), with a column count variating from 1 to 1 million. The test data (13.3 million rows, 130.8 million columns) resulted in 27 GB of storage, without compression. After activating compression it only took 6.6 GB.
I followed some guides on the net on how to activate LZO (which can't be enabled by default due to license terms), but all I tried had some minor faults in them (probably due to version issues).
Anyhow, this is how I did it(assuming Debian or Ubuntu):
apt-get install liblzo2-dev sun-java6-jdk ant
svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/trunk/ hadoop-gpl-compression
cp build/hadoop-gpl-compression-*.jar $HBASE_HOME/lib/
cp build/native/Linux-amd64-64/lib/* /usr/local/lib/
echo “export HBASE_LIBRARY_PATH=/usr/local/lib/” >> $HBASE_HOME/conf/hbase-env.sh
mkdir -p $HBASE_HOME/build
cp -r build/native $HBASE_HOME/build/native
Then verify that it works with:
./bin/hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/testfile lzo