ALT Linux Team development discussions
 help / color / mirror / Atom feed
From: Alexey Tourbin <alexey.tourbin@gmail.com>
To: ALT Linux Team development discussions <devel@lists.altlinux.org>
Subject: Re: [devel] hash collision in rpm
Date: Tue, 4 May 2021 15:50:00 +0300
Message-ID: <CA+qzen=J75Ppb0ADn6D4L9Y1g4-8sJ2ZtjfUF9tpq7oc1gxK4A@mail.gmail.com> (raw)
In-Reply-To: <20210504104532.bd2o2qmcgmves4wl@example.org>

On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov <legion@altlinux.ru> wrote:
> On Tue, May 04, 2021 at 01:32:59PM +0300, Anton Farygin wrote:
> > On 01.05.2021 12:04, Alexey Gladkov wrote:
> > > On Sat, May 01, 2021 at 06:44:31AM +0000, ALT beekeeper wrote:
> > > > Package: rust-1:1.50.0-alt1
> > > > Status: Sisyphus/x86_64 test rebuild failed
> > <skip>
> > > Меня немного настораживает такое обилие предупреждений о коллизии. Как бы
> > > не было беды, когда кто-то решит собрать модули rust отдельно.
> > >
> > Я тоже с таким же столкнулся в новом пакете с библиотеками для C++:
> >
> > warning: hash collision: _ZN23IGESData_IGESReaderData12AddStartLineEPKc
> > _ZNK14IGESSolid_Loop4EdgeEi
> >
> > и там такого довольно много.
>
> Я обсудил это с Димой и он объяснил мне, что это не признак ошибки.
> Вот тут в suggest_bpp.c есть комментарий:
>
> http://git.altlinux.org/gears/r/..git?p=rpm-build.git;a=commitdiff;h=0ea2deff
>
> Если мы выбираем bpp таким образом, чтобы вероятность коллизии была
> 1/1024, то в больших библиотеках (у rust там 16215 symbols) ожидаются
> коллизии, это нормально.

The number of birthday collisions among n-bit hashes is small only if
we hash up to about sqrt(2^n) items. If we want to keep the number of
birthday collisions under control, then bpp=10+log2(n) is simply a bad
formula. A better one is bpp=max(10,log2(n))+log2(n). I.e. for
n=16215, as per your example, it should be 14+14 rather than 10+14.

Now, there is a subtle argument why you shouldn't go 14+14, and why
10+14 is indeed "normal". One view is that we should try to keep the
number of birthday collisions small *per a provided version*.  But
this view is naive. Consider what happens when we have one big shared
library and we split it into two smaller libraries. It seems that
smaller libraries can use smaller bpp, e.g. 13+13 each instead of
14+14 combined. Indeed the number of expected collisions E[X] per each
library is the same. But now we've got two libraries, and the combined
expectation is linear! From the perspective of a client, one OR the
other library is now twice as likely to exhibit a collision, simply
because there are two of them!

You could achieve the same effect - doubling E[X] - without splitting
a big library into two smaller libraries. Just use the big library
with bpp=13 instead of 14. :-)

So the right view must be under the invariant "it doesn't matter how
the symbols are spread among libraries". That's why "probability per
symbol" makes sense, and "restricting birthday collisions" does not.

  reply	other threads:[~2021-05-04 12:50 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-01  9:04 ` [devel] rust-1:1.50.0-alt1: Sisyphus/x86_64 test rebuild failed Alexey Gladkov
2021-05-04 10:32   ` [devel] hash collision in rpm Anton Farygin
2021-05-04 10:45     ` Alexey Gladkov
2021-05-04 12:50       ` Alexey Tourbin [this message]
2021-05-05  5:40         ` Alexey Tourbin
2021-05-05 13:10           ` Dmitry V. Levin
2021-05-05 14:11           ` Vladimir D. Seleznev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+qzen=J75Ppb0ADn6D4L9Y1g4-8sJ2ZtjfUF9tpq7oc1gxK4A@mail.gmail.com' \
    --to=alexey.tourbin@gmail.com \
    --cc=devel@lists.altlinux.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

ALT Linux Team development discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://lore.altlinux.org/devel/0 devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 devel devel/ http://lore.altlinux.org/devel \
		devel@altlinux.org devel@altlinux.ru devel@lists.altlinux.org devel@lists.altlinux.ru devel@linux.iplabs.ru mandrake-russian@linuxteam.iplabs.ru sisyphus@linuxteam.iplabs.ru
	public-inbox-index devel

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://lore.altlinux.org/org.altlinux.lists.devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git