From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 16 Aug 2010 00:05:29 +0400 From: Alexey Tourbin To: devel@lists.altlinux.org Message-ID: <20100815200529.GA27746@altlinux.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="cWoXeonUoKmBZSoM" Content-Disposition: inline Subject: [devel] quick survey: Bernstein hash vs Jenkins hash X-BeenThere: devel@lists.altlinux.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: ALT Linux Team development discussions List-Id: ALT Linux Team development discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Aug 2010 20:05:29 -0000 Archived-At: List-Archive: List-Post: --cWoXeonUoKmBZSoM Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Summary: to implement soname set-versions, we need a hash function for ELF symbols. Despite its theoretical weakness, djb hash seems to perform slightly better than Bob Jenkins' "one-at-a-time" hash. In soname set-versions, each ELF symbol will be represented with its hash value, using typically 20 bits per symbol. Hence we need a hash function which does a good job. Badly. However, since 20 bits is not much more than a completely ridiculous number of bits, basically we don't need a function which is cryptographically strong and which can guard against attacks. We only need a function which plays very well with real-world data, in terms of collisions. If you're unfamiliar with the subject, here's Bob Jenkins' survey: http://burtleburtle.net/bob/hash/doobs.html There's quite a few hash functions which might serve our purpose. To name a few: Jenkins' one-at-a-time hash is used in perl; libstdc++ makes use of FNV hashing; in GNU_HASH, they resort to age-old djb hash. Below I attempt to compare djb hash and Jenkins' hash against each other. There's a subtle point, though: djb does not "mix" bits; to obtain a smaller hash value, one is better off with modulo prime. Jenkins does mix bits, and smaller hash values can be obtained with bitmask. Now note that, to match hash values of different size, we have to use bitmask - modulo prime is simply not an option. However, it seems that djb hash performs quite well even with bitmask. So, I have a program hash-strings.c, and I have some packages under /var/cache/apt/archives. Let's see how 'j' and 'b' 20-bits hashes differ in terms of collisions. $ gcc -Wall -O hash-strings.c -lcrypto -o hash-strings Prepare some data: $ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings b 20 |sort -u >b20 $ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings j 20 |sort -u >j20 $ wc -l b20 j20 205364 b20 205364 j20 410728 total $ If you can't get it, here is what it is: $ head b20 ACCESS_DESCRIPTION_free 00096a3b ACCESS_DESCRIPTION_it 0009f4f6 ACCESS_DESCRIPTION_new 0008a783 ACL_1.0 0006e163 ACL_1.1 0006e164 AESKeyWrap_Decrypt 0009acbb AESKeyWrap_DestroyContext 000b11af AESKeyWrap_Encrypt 00025725 AES_Decrypt 0004c798 AES_DestroyContext 0009b1ec $ Unique hash values: $ sort -u -k2 b20 |wc -l 186769 $ sort -u -k2 j20 |wc -l 186490 $ Total number of collisions: $ sort -k2 b20 |uniq -D -f1 |wc -l 36057 $ sort -k2 j20 |uniq -D -f1 |wc -l 36567 $ Number of collision clusters: $ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |wc -l 17462 $ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |wc -l 17693 $ Number of collisions in a cluster: $ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |sort -n |tail 4 _ZNSs2atEm 000b350b 4 _ZTV13wxRendererGTK 000ddc38 4 _ZTVN9functions11rootsectionE 0005a028 4 ___G__23__23_make_2d_chartable 00004386 4 ___G_c_23_set_2d_val 000afb31 4 _lsa_RemoveAccountRights 00012b0f 4 asn1buf_remains 0002e023 4 cli_rpc_pipe_open_schannel_with_key 00077d4d 4 do_add 00014580 4 pango_font_family_get_name 0008d850 $ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |sort -n |tail 4 _ZNK6OpenSP19ExternalEntityEvent8locationEv 0003b594 4 _ZTI8ppdcFont 000b054d 4 _ZTS13wxDocTemplate 00071287 4 _ZTV16CLimitedInStream 000c02d7 4 ___S_c_23_define_2d_parameterized_2d_decl 000bae87 4 _prelude_async_fork_prepare 0008201f 4 add_setshow_zuinteger_cmd 00012317 4 close_files 0000ceda 5 CONVFMTidx 000aa087 5 _ZN30subselect_single_select_engine7excludeEv 0002810c $ So, readily admitting me being mundane, djb hash is simply better. --cWoXeonUoKmBZSoM Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="hash-strings.c" #include #include #include #include static unsigned int bhash(const char *str) { unsigned int hash = 5381; const unsigned char *p = (const unsigned char *) str; int c; while ((c = *p++)) hash = (hash << 5) + hash + c; return hash; } static unsigned int jhash(const char *str) { unsigned int hash = 0x9e3779b9; const unsigned char *p = (const unsigned char *) str; while (*p) { hash += *p++; hash += (hash << 10); hash ^= (hash >> 6); } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return hash; } #include static unsigned int mhash(const char *str) { unsigned char md5[40]; MD5((const unsigned char *)str, strlen(str), md5); return md5[0] | (md5[1] << 8) | (md5[2] << 16) | (md5[3] << 24); } int main(int argc, char *argv[]) { assert(argc == 3); unsigned int (*hashfunc)(const char *str); switch (*argv[1]) { case 'b': hashfunc = bhash; break; case 'j': hashfunc = jhash; break; case 'm': hashfunc = mhash; break; default: assert("hashfunc" == NULL); break; } int bpp = atoi(argv[2]); assert(bpp > 0 && bpp <= 32); char *line = NULL; size_t alloc_size = 0; ssize_t len; while ((len = getline(&line, &alloc_size, stdin)) >= 0) { if (len > 0 && line[len-1] == '\n') line[--len] = '\0'; if (len == 0) continue; unsigned int hash = hashfunc(line); if (bpp < 32) hash &= (1 << bpp) - 1; printf("%s\t%08x\n", line, hash); } return 0; } --cWoXeonUoKmBZSoM--