From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 19 Sep 2005 00:18:34 +0400 From: Alexey Tourbin To: devel@altlinux.ru Message-ID: <20050918201834.GL2358@solemn.turbinal.org> Mail-Followup-To: devel@altlinux.ru References: <20050916093313.GH2358@solemn.turbinal.org> <20050916102437.GB29958@basalt.office.altlinux.org> <20050917151442.GA2358@solemn.turbinal.org> <20050917153330.GA21043@hell.immo.ru> <20050917222328.GD2358@solemn.turbinal.org> <20050917223243.GA24449@basalt.office.altlinux.org> <20050917230044.GE2358@solemn.turbinal.org> <20050917232346.GB24652@basalt.office.altlinux.org> <20050918084618.GH2358@solemn.turbinal.org> <20050918100251.GI2358@solemn.turbinal.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="xYEXny2b0/vlBQCR" Content-Disposition: inline In-Reply-To: <20050918100251.GI2358@solemn.turbinal.org> Subject: [devel] bloom filters X-BeenThere: devel@altlinux.ru X-Mailman-Version: 2.1.5 Precedence: list Reply-To: ALT Devel discussion list List-Id: ALT Devel discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Sep 2005 20:22:28 -0000 Archived-At: List-Archive: List-Post: --xYEXny2b0/vlBQCR Content-Type: multipart/mixed; boundary="FJOUHINv8uyEJFmS" Content-Disposition: inline --FJOUHINv8uyEJFmS Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Sep 18, 2005 at 02:02:51PM +0400, Alexey Tourbin wrote: > =FC=C6=C6=C5=CB=D4=C9=D7=CE=D9=CA reverse lookup =C2=C5=DA =D0=CF=CC=CE= =CF=CA =D4=C1=C2=CC=C9=C3=D9 =CD=CF=D6=CE=CF =D2=C5=C1=CC=C9=DA=CF=D7=C1=D4= =D8 =CE=C1 > =CF=D3=CE=CF=D7=C5 bloom filters. =F1 =D4=CF=CC=D8=CB=CF =D0=CF=CB=C1 = =CE=C5 =DA=CE=C1=C0, =D7 =CB=C1=CB=D5=C0 =D3=D4=CF=D2=CF=CE=D5 =CF=DB=C9=C2= =CB=C1 > =C2=D5=C4=C5=D4. =F3=C5=CA=DE=C1=D3 =D0=CF=D0=D2=CF=C2=D5=C0 =D3=C4=C5= =CC=C1=D4=D8. bloom filter -- =DC=D4=CF =D3=D0=C5=C3=C9=C1=CC=D8=CE=D9=CA =C2=C9=CE=C1=D2= =CE=D9=CA =C8=C5=DB, =CB=CF=D4=CF=D2=D9=CA =D0=CF=DA=D7=CF=CC=D1=C5=D4 =D0=D2=CF=D7=C5=D2=C9=D4=D8 =D0=D2=C9=CE=C1=C4=CC=C5=D6=CE=CF=D3=D4=D8 =DC= =CC=C5=CD=C5=CE=D4=C1 =CB =CD=CE=CF=D6=C5=D3=D4=D7=D5, =CE=C5 =C9=CD=C5=D1 = =D0=D2=C9 =DC=D4=CF=CD (=CE=C1 =D3=D4=C1=C4=C9=C9 =D0=D2=CF=D7=C5=D2=CB=C9) =D3=C1=CD=CF=C7=CF =CD=CE=CF= =D6=C5=D3=D4=D7=C1 =DC=CC=C5=CD=C5=CE=D4=CF=D7. =ED=CE=CF=D6=C5=D3=D4=D7= =CF =DC=CC=C5=CD=C5=CE=D4=CF=D7 =CE=D5=D6=CE=CF =D4=CF=CC=D8=CB=CF =CE=C1 =D3=D4=C1=C4=C9=C9 =D3=CF=DA=C4= =C1=CE=C9=D1 =C8=C5=DB=C1. =F3=D5=DD=C5=D3=D4=D7=D5=C5=D4 =D7=C5=D2=CF=D1=D4=CE=CF=D3=D4=D8 =CF=DB=C9= =C2=CB=C9 =D4=C9=D0=C1 "false positive" -- =D0=D2=CF=C9=DA=D7=CF=CC=D8=CE= =D9=CA =DC=CC=C5=CD=C5=CE=D4 =CF=D0=D2=C5=C4=C5=CC=D1=C5=D4=D3=D1 =CB=C1=CB =D0=D2= =C9=CE=C1=C4=CC=C5=D6=C1=DD=C9=CA =CB =CD=CE=CF=D6=C5=D3=D4=D7=D5, =CF=C4= =CE=C1=CB=CF =D6=C5 =DC=D4=CF=D4 =DC=CC=C5=CD=C5=CE=D4 =CE=C5 =C2=D9=CC =D0=D2=C5=C4=DF=D1=D7=CC=C5=CE =CE= =C1 =D3=D4=C1=C4=C9=C9 =D3=CF=DA=C4=C1=CE=C9=D1 =C8=C5=DB=C1 (=CE=C5 =D7=C8= =CF=C4=C9=CC =D7 =CD=CE=CF=D6=C5=D3=D4=D7=CF =DC=CC=C5=CD=C5=CE=D4=CF=D7). =F0=D2=C9 =D2=C1=D3=C8=CF=C4=C5 =D0=C1=CD=D1= =D4=C9 2 =C2=C1=CA=D4=C1 =CE=C1 =DC=CC=C5=CD=C5=CE=D4 =D7=C5=D2=CF=D1=D4=CE= =CF=D3=D4=D8 false positive =D3=D4=C1=D4=C9=D3=D4=C9=DE=C5=D3=CB=C9 =CD=C5=CE=D8=DB=C5 1%. = =F4=CF =C5=D3=D4=D8 =D7 =D2=D1=C4=C5 =D3=CC=D5=DE=C1=C5=D7 bloom filters =D0=CF=DA=D7=CF=CC=D1=C0=D4 =CD=C9=CE=C9=CD=D5=CD =CE=C1 =D0=CF=D2=D1=C4=CF= =CB =D3=CF=CB=D2=C1=D4=C9=D4=D8 =D7=D2=C5=CD=D1 =D0=D2=CF=D7=C5=D2=CB=C9/= =D2=C1=D3=C8=CF=C4=D9 =D0=C1=CD=D1=D4=C9, =C5=D3=CC=C9 =D3=C1=CD=C1 =CF=DB=C9=C2=CB=C1 =D4=C1=CB=CF=C7=CF =D2=CF=C4= =C1 =C4=CF=D0=D5=D3=D4=C9=CD=C1. =EF=DB=C9=C2=CB=C9 "false negative" (=D4=CF =C5=D3=D4=D8 =CF=D0=D2=C5=C4=C5= =CC=C5=CE=C9=C5 =DC=CC=C5=CD=C5=CE=D4=CF=D7, =C9=DA=CE=C1=DE=C1=CC=D8=CE=CF =D0=D2=C9=CE=C1=C4=CC=C5=D6=C1=DD=C9=C8 =CD=CE=CF=D6=C5=D3=D4=D7=D5, =CB=C1= =CB =CE=C5 =D0=D2=C9=CE=C1=C4=CC=C5=D6=C1=DD=C9=C8 =DC=D4=CF=CD=D5 =CD=CE= =CF=D6=C5=D3=D4=D7=D5) =CE=C5 =D3=D5=DD=C5=D3=D4=D7=D5=C5=D4. Bloom filter =C9=D3=D0=CF=CC=D8=DA=D5=C5=D4=D3=D1, =CE=C1=D0=D2=C9=CD=C5=D2= , =D7 spellchecker'=C1=C8, =CB=CF=C7=C4=C1 =CE=D5=D6=CE=CF =DA=C1=C8=C5=DB=C9=D2=CF=D7=C1=D4=D8 =D7=D3=C5 "=D0=D2=C1=D7=C9=CC=D8=CE=D9= =C5" =D3=CC=CF=D7=C1. =F0=D2=CF=C9=DA=D7=CF=CC=D8=CE=CF=C5 =CE=C5=D0=D2=C1= =D7=C9=CC=D8=CE=CF=C5 =D3=CC=CF=D7=CF =CD=CF=D6=C5=D4 =D3 =CF=DE=C5=CE=D8 =CE=C5=C2=CF=CC=D8=DB=CF=CA =D7=C5=D2= =CF=D1=D4=CE=CF=D3=D4=D8 =CF=D0=D2=C5=C4=C5=CC=C9=D4=D8=D3=D1 =CB=C1=CB =D0= =D2=C1=D7=C9=CC=D8=CE=CF=C5. =F0=CF=C4=D2=CF=C2=CE=C5=C5 =CF=C2 =C1=CC=C7=CF=D2=C9=D4=CD=C5 =C9 =CF=C2= =CF =D7=D3=A3=CD =CF=D3=D4=C1=CC=D8=CE=CF=CD -- =D0=CF =D3=D3=D9=CC=CB=C1= =CD =D7 =C7=D5=C7=CC=C5. =F4=C5=D0=C5=D2=D8 =CF =D2=C5=C1=CC=C9=DA=C1=C3=C9=D1=C8. =EE=CF=D2=CD=C1= =CC=D8=CE=CF=CA =D2=C5=C1=CC=C9=DA=C1=C3=C9=C9 =CE=C5=D4=D5. =E5=D3=D4=D8 = =D0=C5=D2=CC=CF=D7=D9=CA =CD=CF=C4=D5=CC=D8 Bloom::Filter, =CE=CF =CF=CE "=CE=C5 =D4=D1=CE=C5=D4" =C2=CF=CC=D8=DB=CF=C5= =DE=C9=D3=CC=CF =DC=CC=C5=CD=C5=CE=D4=CF=D7 (=CE=C5=D3=CB=CF=CC=D8=CB=CF = =D4=D9=D3=D1=DE =D4=D1=CE=C5=D4 =CE=CF=D2=CD=C1=CC=D8=CE=CF, =CE=CF =CE=D5=D6=CE=CF =D0=CF= =D2=D1=C4=CB=C1 =CD=C9=CC=CC=C9=CF=CE=C1). =EB =D4=CF=CD=D5 =D6=C5 =D4=C1= =CD =D3=C4=C5=CC=C1=CE=CF =C2=C5=DA=C7=D2=C1=CD=CF=D4=CE=CF =D0=CF =DE=C1=D3=D4=C9 =CD=C1=D4=C5=CD=C1= =D4=C9=CB=C9. C/C++ =D2=C5=C1=CC=C9=DA=C1=C3=C9=C0 =D1 =C9=D3=CB=C1=CC, = =CE=CF =CE=C5 =CE=C1=DB=A3=CC. =F0=CF=DC=D4=CF=CD=D5 =D1 =CE=C1=D0=C9=D3=C1=CC =D3=D7=CF=C0 =D5=D0=D2=CF= =DD=C5=CE=CE=D5=C0 =D2=C5=C1=CC=C9=DA=C1=C3=C9=C0. =F2=C1=C2=CF=D4=C1=C5= =D4 =DC=D4=CF =D4=C1=CB: $ gcc -o bloom bloom.c -Wall -lm -lssl $ wc -l /usr/share/dict/words 45427 /usr/share/dict/words $ ./bloom -n 50000 /usr/share/dict/words >words.bf $ ls -sH1 /usr/share/dict/words words.bf 400 /usr/share/dict/words 60 words.bf $ head /usr/share/dict/words ALGOL ANSI ARCO ARPA ARPANET ASCII Aarhus Aaron Ababa Abba $ ./bloom -e ALGOL words.bf; echo $? 0 $ ./bloom -e ANSI words.bf; echo $? 0 $ ./bloom -e ALGOLANSI words.bf; echo $? 1 $ ./bloom -e ANSIALGOL words.bf; echo $? 1 $ =F1 =DA=C1=D7=D4=D2=C1 =C5=C7=CF =CE=C1=D7=C5=D2=CE=CF=C5 =C5=DD=A3 =CE=C1= =D0=C9=CC=D8=CE=C9=CB=CF=CD =C9 =D5=D0=C1=CB=D5=C0. =EF=DB=C9=C2=CB=C9 =D1= =D0=CF=CB=C1 =CE=C5 =C9=D3=CB=C1=CC; =C7=CC=C1=D7=CE=CF=C5, =DE=D4=CF =D2=C1=C2=CF=D4=C1=C5=D4. :) =F4=CF =C5=D3=D4=D8 =CB =DE=C5=CD=D5 =DC=D4=CF =D7=D3=A3: 350-=CD=C5=D4=D2= =CF=D7=D9=CA =C4=C1=CD=D0 ELF-=D3=C9=CD=D7=CF=CC=CF=D7 -- =DC=D4=CF =C5=DD= =A3 =CE=C5 =CB=CF=CE=C5=C3 =D3=D7=C5=D4=C1. =EE=C1 =D3=C1=CD=CF=CD =C4=C5=CC=C5 =D7= =D3=A3 =D0=C1=CB=D5=C5=D4=D3=D1 =C9=DA =D2=C1=D3=DE=C5=D4=C1 2 =C2=C1=CA=D4= =C1 =CE=C1 =D3=C9=CD=D7=CF=CC. --FJOUHINv8uyEJFmS Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bloom.c" #include #include #include #include #include #include #include typedef struct BloomFilter { size_t n; /* capacity */ double p; /* false positive rate */ size_t m; /* number of bits in vector */ size_t k; /* number of hash functions */ char v[1]; } BF; BF *BF_new(size_t n, double p) { BF *bf; size_t m, k, nb; assert(n > 0 && p > 0 && p < 1); m = n * log(p) / log(0.6185); k = log(p) / log(0.5); assert(m > 0 && k > 0); nb = sizeof(BF) + m / 8; bf = malloc(nb); assert(bf); memset(bf, 0, nb); bf->n = n; bf->p = p; bf->m = m; bf->k = k; return bf; } BF *BF_load(FILE *fp) { size_t n; BF *bf = malloc(sizeof(BF)); assert(bf); n = fread(bf, sizeof(BF), 1, fp); assert(n == 1); assert(bf->n > 0 && bf->p > 0 && bf->p < 1); assert(bf->m > 0 && bf->k > 0); bf = realloc(bf, sizeof(BF) + bf->m / 8); assert(bf); rewind(fp); n = fread(bf, sizeof(BF) + bf->m / 8, 1, fp); assert(n == 1); return bf; } void BF_save(BF *bf, FILE *fp) { size_t nb = sizeof(BF) + bf->m / 8; size_t n = fwrite(bf, nb, 1, fp); assert(n == 1); } void BF_set(BF *bf, size_t n) { assert(bf->m >= n); bf->v[n / 8] |= (1 << (n % 8)); } int BF_isset(BF *bf, size_t n) { assert(bf->m >= n); return bf->v[n / 8] & (1 << (n % 8)); } static size_t rehash(const char digest[], int i) { size_t hash = digest[(i + 1) % 20] + (digest[(i + 2) % 20] << 8) + (digest[(i + 3) % 20] << 16) + (digest[(i + 4) % 20] << 24); hash ^= digest[(i + 6) % 20] + (digest[(i + 7) % 20] << 8) + (digest[(i + 8) % 20] << 16) + (digest[(i + 9) % 20] << 24); return hash; } void BF_add(BF *bf, const char *str, size_t len) { char digest[20]; int i; SHA1(str, len, digest); for (i = 0; i < bf->k; i++) { size_t hash = rehash(digest, i); BF_set(bf, hash % bf->m); } } int BF_exists(BF *bf, const char *str, size_t len) { char digest[20]; int i; SHA1(str, len, digest); for (i = 0; i < bf->k; i++) { size_t hash = rehash(digest, i); int set = BF_isset(bf, hash % bf->m); if (!set) return 0; } return 1; } int main(int argc, char *argv[]) { size_t n = 1024; double p = 0.01; char *e = NULL; int c; while ((c = getopt(argc, argv, "n:p:e:")) != -1) { switch (c) { case 'n': n = strtoul(optarg, NULL, 10); break; case 'p': p = atof(optarg); break; case 'e': e = optarg; break; default: exit(2); } } if (optind + 1 != argc) { fprintf(stderr, "arg count\n"); exit(2); } if (e) { int exists; FILE *fp = fopen(argv[optind], "r"); assert(fp); BF *bf = BF_load(fp); exists = BF_exists(bf, e, strlen(e)); exit(!exists); } else { char line[1024]; BF *bf = BF_new(n, p); FILE *fp = fopen(argv[optind], "r"); assert(fp); while (fgets(line, sizeof(line), fp)) { int len = strlen(line); if (line[len - 1] == '\n') len--; BF_add(bf, line, len); } BF_save(bf, stdout); } return 0; } --FJOUHINv8uyEJFmS-- --xYEXny2b0/vlBQCR Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFDLcuafBKgtDjnu0YRAiCBAJsHA7wdhTpwW3NgAG24Aax6rrBHyQCdGFCB bvTUcUiPeYLX8I+XhIxVkzY= =mg7B -----END PGP SIGNATURE----- --xYEXny2b0/vlBQCR--