From: Alexey Tourbin <alexey.tourbin@gmail.com> To: ALT Linux Team development discussions <devel@lists.altlinux.org> Subject: Re: [devel] hash collision in rpm Date: Wed, 5 May 2021 08:40:27 +0300 Message-ID: <CA+qzen=zsq3au7q_6r4bpigQUeMFFy66jR3x6wo3N=cYNZaODQ@mail.gmail.com> (raw) In-Reply-To: <CA+qzen=J75Ppb0ADn6D4L9Y1g4-8sJ2ZtjfUF9tpq7oc1gxK4A@mail.gmail.com> On Tue, May 4, 2021 at 3:50 PM Alexey Tourbin <alexey.tourbin@gmail.com> wrote: > On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov <legion@altlinux.ru> wrote: > > Если мы выбираем bpp таким образом, чтобы вероятность коллизии была > > 1/1024, то в больших библиотеках (у rust там 16215 symbols) ожидаются > > коллизии, это нормально. > > The number of birthday collisions among n-bit hashes is small only if > we hash up to about sqrt(2^n) items. If we want to keep the number of > birthday collisions under control, then bpp=10+log2(n) is simply a bad > formula. A better one is bpp=max(10,log2(n))+log2(n). I.e. for > n=16215, as per your example, it should be 14+14 rather than 10+14. > > Now, there is a subtle argument why you shouldn't go 14+14, and why > 10+14 is indeed "normal". One view is that we should try to keep the > number of birthday collisions small *per a provided version*. But > this view is naive. Consider what happens when we have one big shared > library and we split it into two smaller libraries. It seems that > smaller libraries can use smaller bpp, e.g. 13+13 each instead of > 14+14 combined. Indeed the number of expected collisions E[X] per each > library is the same. But now we've got two libraries, and the combined > expectation is linear! From the perspective of a client, one OR the > other library is now twice as likely to exhibit a collision, simply > because there are two of them! > So the right view must be under the invariant "it doesn't matter how > the symbols are spread among libraries". That's why "probability per > symbol" makes sense, and "restricting birthday collisions" does not. Let's check that the invariant holds with bpp=10+log2(n). Again, we split a long set-version with bpp=10+14 into two shorter set-versions with bpp=10+13 (the number of symbols being 2^14 resp. 2^13). The number of hash values over the hash space, which is how we define the false positive rate, is the same: 2^14/2^24=2^13/2^23=2^-10. A related notion is the average distance between two neighbour hash values (when we "drop" hash values into the interval [0,2^bpp)). The distance is approximately 2^-10 in both cases. It defines the optimal parameter m for Golomb-Rice code. Roughly speaking, with bpp=10+log2(n), the optimal m is 10. With optimal m, the length of Golomb-Rice code is approximately m+2 bits per integer. So the combined length of the two shorter versions must be about the same as the length of the long version. Finally we need to check the birthday collision rate. With bpp-bit hash values, n of them, the expected number of birthday collisions is (to a good approximation) E[X]=n^2 / 2^(bpp+1) (refer to Cormen et al., Introduction to Algorithms (3rd ed.), p. 133.) Let E_0 be the expected number of collisions for a long version. For a shorter version with (bbp-1)-bit hash values, n/2 of them, we have E_1 = (n/2)^2 / (2^(bpp-1+1) = 1/2 E_0 So the expected number of collisions for short versions is smaller, but we have two of them. By the linearity of expectation, the birthday collision rate is the same. (This generally shows that it is pointless to split a long set-version into a few shorter ones, under any classes of equivalence on symbols such as ELF ABI tags. Counterintuitively, you can't fight birthday collisions by partitioning, unless you also increase the output length. So doing separate set-versions for libfoo.so.0(FOO_1.0) and libfoo.so.0(FOO_1.1) is not necessarily a good idea, we can just as well hash sym@FOO_1.0 and sym@FOO_1.1 and package them into a single version.) * * * By the way, there's a clever way to detect collisions at a higher level, even though low-level collisions are unavoidable, subject to implacable math. We can take advantage of the fact that the build system synchronizes package builds across a few architectures. Currently, if a missing symbol goes undetected due to a hash collision, it does so on all architectures simultaneously. To change that, we need to hash symbols differently, depending on the target architecture (we can pass seed=hash(arch) to the hash function, or use different initialization vectors IV=hash(arch)). The desired outcome is that when a missing symbol goes undetected, it does so, with an overwhelming probability, on only one target. Roughly speaking, if the probability of an undetected missing symbol is 10^-3 on a single target, it must be 10^-6 on two targets, 10^-9 on thee targets, etc.
next prev parent reply other threads:[~2021-05-05 5:40 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-01 9:04 ` [devel] rust-1:1.50.0-alt1: Sisyphus/x86_64 test rebuild failed Alexey Gladkov 2021-05-04 10:32 ` [devel] hash collision in rpm Anton Farygin 2021-05-04 10:45 ` Alexey Gladkov 2021-05-04 12:50 ` Alexey Tourbin 2021-05-05 5:40 ` Alexey Tourbin [this message] 2021-05-05 13:10 ` Dmitry V. Levin 2021-05-05 14:11 ` Vladimir D. Seleznev
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CA+qzen=zsq3au7q_6r4bpigQUeMFFy66jR3x6wo3N=cYNZaODQ@mail.gmail.com' \ --to=alexey.tourbin@gmail.com \ --cc=devel@lists.altlinux.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
ALT Linux Team development discussions This inbox may be cloned and mirrored by anyone: git clone --mirror http://lore.altlinux.org/devel/0 devel/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 devel devel/ http://lore.altlinux.org/devel \ devel@altlinux.org devel@altlinux.ru devel@lists.altlinux.org devel@lists.altlinux.ru devel@linux.iplabs.ru mandrake-russian@linuxteam.iplabs.ru sisyphus@linuxteam.iplabs.ru public-inbox-index devel Example config snippet for mirrors. Newsgroup available over NNTP: nntp://lore.altlinux.org/org.altlinux.lists.devel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git