想要作 MapReduce 的工作,大概拿 Hadoop Streaming 試試,於是想要把資料弄成 line-based 模式,接著想到資料壓縮處理,然後就想測一下到底哪種比較合適

  • base64
  • json
  • bz2
  • gzip

雖然腦子裡大概有譜了,但還是測一下好了

#!/usr/bin/env python

from timeit import Timer
import json
import base64
import bz2
import zlib

s = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()_+|'
def do_base64():
        encoded = base64.b64encode( s )
        decoded = base64.b64decode( encoded )

def do_json():
        encoded = json.dumps( s )
        decoded = json.loads( encoded )

def do_bz2():
        encoded = bz2.compress( s )
        decoded = bz2.decompress( encoded )

def do_gzip():
        encoded = zlib.compress( s )
        decoded = zlib.decompress( encoded )

if __name__ == '__main__':
        t1 = Timer( "do_base64()" , "from __main__ import do_base64" )
        try:
                print "Encode & Decode By base64: " + str( t1.timeit() )
        except:
                t1.print_exc()
        t2 = Timer( "do_json()" , "from __main__ import do_json" )
        try:
                print "Encode & Decode By json: " + str( t2.timeit() )
        except:
                t2.print_exc()

        t3 = Timer( "do_bz2()" , "from __main__ import do_bz2" )
        try:
                print "Encode & Decode By bz2: " + str( t3.timeit() )
        except:
                t3.print_exc()

        t4 = Timer( "do_gzip()" , "from __main__ import do_gzip" )
        try:
                print "Encode & Decode By gzip: " + str( t4.timeit() )
        except:
                t4.print_exc()

在 AMD x4 955 + 4 GB DDR3 1200 搭配 Ubuntu 10.04 i386:

$ python t.py
Encode & Decode By base64: 2.40118098259
Encode & Decode By json: 12.9051868916
Encode & Decode By bz2: 105.709769011
Encode & Decode By gzip: 19.3650279045

看來 base64 還是挺不錯的選擇,過去對他的印象是資料編碼後大小會長 50% 左右。另外,timeit 預設是跑 1,000,000 次。(由於測資沒有重複性,大概對壓縮類的不公平 XD)

相關資料


changyy 發表在 痞客邦 PIXNET 留言(0) 人氣()