From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path:
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on
sa.local.altlinux.org
X-Spam-Level:
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM autolearn=ham autolearn_force=no
version=3.4.1
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:content-transfer-encoding;
bh=xPuESSutYxAOhVhuUV934xENE4NLBxlCSOIOy1NA8Hw=;
b=aOoyc1Z5I5XB1KPgB6tv7yUO2wtjz7fC9w6H4PRV84onvl7fSRaoXYltwBacsjA+Yo
Jt37R+Zr3SCJ9QWVUNZ65D76cTL+DpiT8BiLEWlcOZUncI4xV4NfkT8H0qwLYGsL4A13
6wzuhp53+AuglnZhIyNzgpAoNKYsOSqf3jyP5BWgRoJs5B3rntG9PP7B8sl/p1xTKajZ
1cFCCcmlCNzFD2lpbgk9SWEOzXM97WieJvldzEk+UqAZSh/P+iO0XMKpmlDOfAjkgkMx
aB4/nrUYLTyQXkGWsmQSZ7kAZsI3Zv/bU3/Epl3LSEz8vyYO4W1HzmssBPR3dXoCdPDV
9U9w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:content-transfer-encoding;
bh=xPuESSutYxAOhVhuUV934xENE4NLBxlCSOIOy1NA8Hw=;
b=hdZhtGv23PudRr8jtUuGCHA6EXIDl6FajcrTwW45Kn2vdnWvWPYeuo2AZGBzjAHpjm
2tLrfaJOb4rujMNvRziqemy2vZZO2c7tq95GztT0ZReSg6zx+2GBzl103iuzoUhcSYt0
pPyEXxlkwSPCIaqZiNeQeBqhXccpGXDLatFm4baHWXf6/SclOUVgNyhHWlcve4akMKnK
Q+RIqRTaEDU2qAkbYIwn90rGOD8D9GHvnwUkABxxqwN2GCtM5Dnd9BmYcMHNDnWsLSAB
BOBj0aaUk7onsVbtI9S3xOx4zu7ATglDMINWR3I8wfUNkSyN6/OAiYdNzsP+b7F49Np4
kiuw==
X-Gm-Message-State: AOAM5316jqdwtBYnG0bQGZ3soCyvx//spApC09mpNDRNMkZizQpfVyqK
ffFlYYFsMlJGypepAWpQRwkgNjR+NQtWmZaGZgYc6QIRUfBBKzKe
X-Google-Smtp-Source: ABdhPJx+6Cnvnze+Cgp0PBlI9HtJhZ2CoIPTrvOZM8G23a37S+sU6rxckC8LTs4ZBS4kXl8eM312NL0OpCps/rDsirw=
X-Received: by 2002:a05:651c:118c:: with SMTP id
w12mr7045406ljo.405.1620193238138;
Tue, 04 May 2021 22:40:38 -0700 (PDT)
MIME-Version: 1.0
References: <20210501064431.C5F099A456D@gyle.altlinux.org>
<20210501090437.bp7fx33thclcgman@example.org>
<079b855b-b0d0-b0e6-2da3-4b31f4b4ab1f@basealt.ru>
<20210504104532.bd2o2qmcgmves4wl@example.org>
In-Reply-To:
From: Alexey Tourbin
Date: Wed, 5 May 2021 08:40:27 +0300
Message-ID:
To: ALT Linux Team development discussions
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [devel] hash collision in rpm
X-BeenThere: devel@lists.altlinux.org
X-Mailman-Version: 2.1.12
Precedence: list
Reply-To: ALT Linux Team development discussions
List-Id: ALT Linux Team development discussions
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
X-List-Received-Date: Wed, 05 May 2021 05:40:41 -0000
Archived-At:
List-Archive:
List-Post:
On Tue, May 4, 2021 at 3:50 PM Alexey Tourbin wr=
ote:
> On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov wrote:
> > =D0=95=D1=81=D0=BB=D0=B8 =D0=BC=D1=8B =D0=B2=D1=8B=D0=B1=D0=B8=D1=80=D0=
=B0=D0=B5=D0=BC bpp =D1=82=D0=B0=D0=BA=D0=B8=D0=BC =D0=BE=D0=B1=D1=80=D0=B0=
=D0=B7=D0=BE=D0=BC, =D1=87=D1=82=D0=BE=D0=B1=D1=8B =D0=B2=D0=B5=D1=80=D0=BE=
=D1=8F=D1=82=D0=BD=D0=BE=D1=81=D1=82=D1=8C =D0=BA=D0=BE=D0=BB=D0=BB=D0=B8=
=D0=B7=D0=B8=D0=B8 =D0=B1=D1=8B=D0=BB=D0=B0
> > 1/1024, =D1=82=D0=BE =D0=B2 =D0=B1=D0=BE=D0=BB=D1=8C=D1=88=D0=B8=D1=85 =
=D0=B1=D0=B8=D0=B1=D0=BB=D0=B8=D0=BE=D1=82=D0=B5=D0=BA=D0=B0=D1=85 (=D1=83 =
rust =D1=82=D0=B0=D0=BC 16215 symbols) =D0=BE=D0=B6=D0=B8=D0=B4=D0=B0=D1=8E=
=D1=82=D1=81=D1=8F
> > =D0=BA=D0=BE=D0=BB=D0=BB=D0=B8=D0=B7=D0=B8=D0=B8, =D1=8D=D1=82=D0=BE =
=D0=BD=D0=BE=D1=80=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE.
>
> The number of birthday collisions among n-bit hashes is small only if
> we hash up to about sqrt(2^n) items. If we want to keep the number of
> birthday collisions under control, then bpp=3D10+log2(n) is simply a bad
> formula. A better one is bpp=3Dmax(10,log2(n))+log2(n). I.e. for
> n=3D16215, as per your example, it should be 14+14 rather than 10+14.
>
> Now, there is a subtle argument why you shouldn't go 14+14, and why
> 10+14 is indeed "normal". One view is that we should try to keep the
> number of birthday collisions small *per a provided version*. But
> this view is naive. Consider what happens when we have one big shared
> library and we split it into two smaller libraries. It seems that
> smaller libraries can use smaller bpp, e.g. 13+13 each instead of
> 14+14 combined. Indeed the number of expected collisions E[X] per each
> library is the same. But now we've got two libraries, and the combined
> expectation is linear! From the perspective of a client, one OR the
> other library is now twice as likely to exhibit a collision, simply
> because there are two of them!
> So the right view must be under the invariant "it doesn't matter how
> the symbols are spread among libraries". That's why "probability per
> symbol" makes sense, and "restricting birthday collisions" does not.
Let's check that the invariant holds with bpp=3D10+log2(n). Again, we
split a long set-version with bpp=3D10+14 into two shorter set-versions
with bpp=3D10+13 (the number of symbols being 2^14 resp. 2^13). The
number of hash values over the hash space, which is how we define the
false positive rate, is the same: 2^14/2^24=3D2^13/2^23=3D2^-10.
A related notion is the average distance between two neighbour hash
values (when we "drop" hash values into the interval [0,2^bpp)). The
distance is approximately 2^-10 in both cases. It defines the optimal
parameter m for Golomb-Rice code. Roughly speaking, with
bpp=3D10+log2(n), the optimal m is 10. With optimal m, the length of
Golomb-Rice code is approximately m+2 bits per integer. So the
combined length of the two shorter versions must be about the same as
the length of the long version.
Finally we need to check the birthday collision rate. With bpp-bit
hash values, n of them, the expected number of birthday collisions is
(to a good approximation)
E[X]=3Dn^2 / 2^(bpp+1)
(refer to Cormen et al., Introduction to Algorithms (3rd ed.), p. 133.)
Let E_0 be the expected number of collisions for a long version. For a
shorter version with (bbp-1)-bit hash values, n/2 of them, we have
E_1 =3D (n/2)^2 / (2^(bpp-1+1) =3D 1/2 E_0
So the expected number of collisions for short versions is smaller,
but we have two of them. By the linearity of expectation, the birthday
collision rate is the same. (This generally shows that it is pointless
to split a long set-version into a few shorter ones, under any classes
of equivalence on symbols such as ELF ABI tags. Counterintuitively,
you can't fight birthday collisions by partitioning, unless you also
increase the output length. So doing separate set-versions for
libfoo.so.0(FOO_1.0) and libfoo.so.0(FOO_1.1) is not necessarily a
good idea, we can just as well hash sym@FOO_1.0 and sym@FOO_1.1 and
package them into a single version.)
* * *
By the way, there's a clever way to detect collisions at a higher
level, even though low-level collisions are unavoidable, subject to
implacable math. We can take advantage of the fact that the build
system synchronizes package builds across a few architectures.
Currently, if a missing symbol goes undetected due to a hash
collision, it does so on all architectures simultaneously. To change
that, we need to hash symbols differently, depending on the target
architecture (we can pass seed=3Dhash(arch) to the hash function, or use
different initialization vectors IV=3Dhash(arch)). The desired outcome
is that when a missing symbol goes undetected, it does so, with an
overwhelming probability, on only one target. Roughly speaking, if
the probability of an undetected missing symbol is 10^-3 on a single
target, it must be 10^-6 on two targets, 10^-9 on thee targets, etc.