ALT Linux Team development discussions
 help / color / mirror / Atom feed
From: Alexey Tourbin <>
To: ALT Linux Team development discussions <>
Subject: Re: [devel] hash collision in rpm
Date: Wed, 5 May 2021 08:40:27 +0300
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, May 4, 2021 at 3:50 PM Alexey Tourbin <> wrote:
> On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov <> wrote:
> > Если мы выбираем bpp таким образом, чтобы вероятность коллизии была
> > 1/1024, то в больших библиотеках (у rust там 16215 symbols) ожидаются
> > коллизии, это нормально.
> The number of birthday collisions among n-bit hashes is small only if
> we hash up to about sqrt(2^n) items. If we want to keep the number of
> birthday collisions under control, then bpp=10+log2(n) is simply a bad
> formula. A better one is bpp=max(10,log2(n))+log2(n). I.e. for
> n=16215, as per your example, it should be 14+14 rather than 10+14.
> Now, there is a subtle argument why you shouldn't go 14+14, and why
> 10+14 is indeed "normal". One view is that we should try to keep the
> number of birthday collisions small *per a provided version*.  But
> this view is naive. Consider what happens when we have one big shared
> library and we split it into two smaller libraries. It seems that
> smaller libraries can use smaller bpp, e.g. 13+13 each instead of
> 14+14 combined. Indeed the number of expected collisions E[X] per each
> library is the same. But now we've got two libraries, and the combined
> expectation is linear! From the perspective of a client, one OR the
> other library is now twice as likely to exhibit a collision, simply
> because there are two of them!

> So the right view must be under the invariant "it doesn't matter how
> the symbols are spread among libraries". That's why "probability per
> symbol" makes sense, and "restricting birthday collisions" does not.

Let's check that the invariant holds with bpp=10+log2(n). Again, we
split a long set-version with bpp=10+14 into two shorter set-versions
with bpp=10+13 (the number of symbols being 2^14 resp. 2^13). The
number of hash values over the hash space, which is how we define the
false positive rate, is the same: 2^14/2^24=2^13/2^23=2^-10.

A related notion is the average distance between two neighbour hash
values (when we "drop" hash values into the interval [0,2^bpp)). The
distance is approximately 2^-10 in both cases. It defines the optimal
parameter m for Golomb-Rice code. Roughly speaking, with
bpp=10+log2(n), the optimal m is 10. With optimal m, the length of
Golomb-Rice code is approximately m+2 bits per integer. So the
combined length of the two shorter versions must be about the same as
the length of the long version.

Finally we need to check the birthday collision rate. With bpp-bit
hash values, n of them, the expected number of birthday collisions is
(to a good approximation)

E[X]=n^2 / 2^(bpp+1)
(refer to Cormen et al., Introduction to Algorithms (3rd ed.), p. 133.)

Let E_0 be the expected number of collisions for a long version. For a
shorter version with (bbp-1)-bit hash values, n/2 of them, we have

E_1 = (n/2)^2 / (2^(bpp-1+1) = 1/2 E_0

So the expected number of collisions for short versions is smaller,
but we have two of them. By the linearity of expectation, the birthday
collision rate is the same. (This generally shows that it is pointless
to split a long set-version into a few shorter ones, under any classes
of equivalence on symbols such as ELF ABI tags. Counterintuitively,
you can't fight birthday collisions by partitioning, unless you also
increase the output length. So doing separate set-versions for and is not necessarily a
good idea, we can just as well hash sym@FOO_1.0 and sym@FOO_1.1 and
package them into a single version.)

* * *

By the way, there's a clever way to detect collisions at a higher
level, even though low-level collisions are unavoidable, subject to
implacable math. We can take advantage of the fact that the build
system synchronizes package builds across a few architectures.
Currently, if a missing symbol goes undetected due to a hash
collision, it does so on all architectures simultaneously. To change
that, we need to hash symbols differently, depending on the target
architecture (we can pass seed=hash(arch) to the hash function, or use
different initialization vectors IV=hash(arch)).  The desired outcome
is that when a missing symbol goes undetected, it does so, with an
overwhelming probability, on only one target.  Roughly speaking, if
the probability of an undetected missing symbol is 10^-3 on a single
target, it must be 10^-6 on two targets, 10^-9 on thee targets, etc.

  reply	other threads:[~2021-05-05  5:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-01  9:04 ` [devel] rust-1:1.50.0-alt1: Sisyphus/x86_64 test rebuild failed Alexey Gladkov
2021-05-04 10:32   ` [devel] hash collision in rpm Anton Farygin
2021-05-04 10:45     ` Alexey Gladkov
2021-05-04 12:50       ` Alexey Tourbin
2021-05-05  5:40         ` Alexey Tourbin [this message]
2021-05-05 13:10           ` Dmitry V. Levin
2021-05-05 14:11           ` Vladimir D. Seleznev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

ALT Linux Team development discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 devel devel/ \
	public-inbox-index devel

Example config snippet for mirrors.
Newsgroup available over NNTP:

AGPL code for this site: git clone