From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa.local.altlinux.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=xPuESSutYxAOhVhuUV934xENE4NLBxlCSOIOy1NA8Hw=; b=aOoyc1Z5I5XB1KPgB6tv7yUO2wtjz7fC9w6H4PRV84onvl7fSRaoXYltwBacsjA+Yo Jt37R+Zr3SCJ9QWVUNZ65D76cTL+DpiT8BiLEWlcOZUncI4xV4NfkT8H0qwLYGsL4A13 6wzuhp53+AuglnZhIyNzgpAoNKYsOSqf3jyP5BWgRoJs5B3rntG9PP7B8sl/p1xTKajZ 1cFCCcmlCNzFD2lpbgk9SWEOzXM97WieJvldzEk+UqAZSh/P+iO0XMKpmlDOfAjkgkMx aB4/nrUYLTyQXkGWsmQSZ7kAZsI3Zv/bU3/Epl3LSEz8vyYO4W1HzmssBPR3dXoCdPDV 9U9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=xPuESSutYxAOhVhuUV934xENE4NLBxlCSOIOy1NA8Hw=; b=hdZhtGv23PudRr8jtUuGCHA6EXIDl6FajcrTwW45Kn2vdnWvWPYeuo2AZGBzjAHpjm 2tLrfaJOb4rujMNvRziqemy2vZZO2c7tq95GztT0ZReSg6zx+2GBzl103iuzoUhcSYt0 pPyEXxlkwSPCIaqZiNeQeBqhXccpGXDLatFm4baHWXf6/SclOUVgNyhHWlcve4akMKnK Q+RIqRTaEDU2qAkbYIwn90rGOD8D9GHvnwUkABxxqwN2GCtM5Dnd9BmYcMHNDnWsLSAB BOBj0aaUk7onsVbtI9S3xOx4zu7ATglDMINWR3I8wfUNkSyN6/OAiYdNzsP+b7F49Np4 kiuw== X-Gm-Message-State: AOAM5316jqdwtBYnG0bQGZ3soCyvx//spApC09mpNDRNMkZizQpfVyqK ffFlYYFsMlJGypepAWpQRwkgNjR+NQtWmZaGZgYc6QIRUfBBKzKe X-Google-Smtp-Source: ABdhPJx+6Cnvnze+Cgp0PBlI9HtJhZ2CoIPTrvOZM8G23a37S+sU6rxckC8LTs4ZBS4kXl8eM312NL0OpCps/rDsirw= X-Received: by 2002:a05:651c:118c:: with SMTP id w12mr7045406ljo.405.1620193238138; Tue, 04 May 2021 22:40:38 -0700 (PDT) MIME-Version: 1.0 References: <20210501064431.C5F099A456D@gyle.altlinux.org> <20210501090437.bp7fx33thclcgman@example.org> <079b855b-b0d0-b0e6-2da3-4b31f4b4ab1f@basealt.ru> <20210504104532.bd2o2qmcgmves4wl@example.org> In-Reply-To: From: Alexey Tourbin Date: Wed, 5 May 2021 08:40:27 +0300 Message-ID: To: ALT Linux Team development discussions Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [devel] hash collision in rpm X-BeenThere: devel@lists.altlinux.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: ALT Linux Team development discussions List-Id: ALT Linux Team development discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2021 05:40:41 -0000 Archived-At: List-Archive: List-Post: On Tue, May 4, 2021 at 3:50 PM Alexey Tourbin wr= ote: > On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov wrote: > > =D0=95=D1=81=D0=BB=D0=B8 =D0=BC=D1=8B =D0=B2=D1=8B=D0=B1=D0=B8=D1=80=D0= =B0=D0=B5=D0=BC bpp =D1=82=D0=B0=D0=BA=D0=B8=D0=BC =D0=BE=D0=B1=D1=80=D0=B0= =D0=B7=D0=BE=D0=BC, =D1=87=D1=82=D0=BE=D0=B1=D1=8B =D0=B2=D0=B5=D1=80=D0=BE= =D1=8F=D1=82=D0=BD=D0=BE=D1=81=D1=82=D1=8C =D0=BA=D0=BE=D0=BB=D0=BB=D0=B8= =D0=B7=D0=B8=D0=B8 =D0=B1=D1=8B=D0=BB=D0=B0 > > 1/1024, =D1=82=D0=BE =D0=B2 =D0=B1=D0=BE=D0=BB=D1=8C=D1=88=D0=B8=D1=85 = =D0=B1=D0=B8=D0=B1=D0=BB=D0=B8=D0=BE=D1=82=D0=B5=D0=BA=D0=B0=D1=85 (=D1=83 = rust =D1=82=D0=B0=D0=BC 16215 symbols) =D0=BE=D0=B6=D0=B8=D0=B4=D0=B0=D1=8E= =D1=82=D1=81=D1=8F > > =D0=BA=D0=BE=D0=BB=D0=BB=D0=B8=D0=B7=D0=B8=D0=B8, =D1=8D=D1=82=D0=BE = =D0=BD=D0=BE=D1=80=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE. > > The number of birthday collisions among n-bit hashes is small only if > we hash up to about sqrt(2^n) items. If we want to keep the number of > birthday collisions under control, then bpp=3D10+log2(n) is simply a bad > formula. A better one is bpp=3Dmax(10,log2(n))+log2(n). I.e. for > n=3D16215, as per your example, it should be 14+14 rather than 10+14. > > Now, there is a subtle argument why you shouldn't go 14+14, and why > 10+14 is indeed "normal". One view is that we should try to keep the > number of birthday collisions small *per a provided version*. But > this view is naive. Consider what happens when we have one big shared > library and we split it into two smaller libraries. It seems that > smaller libraries can use smaller bpp, e.g. 13+13 each instead of > 14+14 combined. Indeed the number of expected collisions E[X] per each > library is the same. But now we've got two libraries, and the combined > expectation is linear! From the perspective of a client, one OR the > other library is now twice as likely to exhibit a collision, simply > because there are two of them! > So the right view must be under the invariant "it doesn't matter how > the symbols are spread among libraries". That's why "probability per > symbol" makes sense, and "restricting birthday collisions" does not. Let's check that the invariant holds with bpp=3D10+log2(n). Again, we split a long set-version with bpp=3D10+14 into two shorter set-versions with bpp=3D10+13 (the number of symbols being 2^14 resp. 2^13). The number of hash values over the hash space, which is how we define the false positive rate, is the same: 2^14/2^24=3D2^13/2^23=3D2^-10. A related notion is the average distance between two neighbour hash values (when we "drop" hash values into the interval [0,2^bpp)). The distance is approximately 2^-10 in both cases. It defines the optimal parameter m for Golomb-Rice code. Roughly speaking, with bpp=3D10+log2(n), the optimal m is 10. With optimal m, the length of Golomb-Rice code is approximately m+2 bits per integer. So the combined length of the two shorter versions must be about the same as the length of the long version. Finally we need to check the birthday collision rate. With bpp-bit hash values, n of them, the expected number of birthday collisions is (to a good approximation) E[X]=3Dn^2 / 2^(bpp+1) (refer to Cormen et al., Introduction to Algorithms (3rd ed.), p. 133.) Let E_0 be the expected number of collisions for a long version. For a shorter version with (bbp-1)-bit hash values, n/2 of them, we have E_1 =3D (n/2)^2 / (2^(bpp-1+1) =3D 1/2 E_0 So the expected number of collisions for short versions is smaller, but we have two of them. By the linearity of expectation, the birthday collision rate is the same. (This generally shows that it is pointless to split a long set-version into a few shorter ones, under any classes of equivalence on symbols such as ELF ABI tags. Counterintuitively, you can't fight birthday collisions by partitioning, unless you also increase the output length. So doing separate set-versions for libfoo.so.0(FOO_1.0) and libfoo.so.0(FOO_1.1) is not necessarily a good idea, we can just as well hash sym@FOO_1.0 and sym@FOO_1.1 and package them into a single version.) * * * By the way, there's a clever way to detect collisions at a higher level, even though low-level collisions are unavoidable, subject to implacable math. We can take advantage of the fact that the build system synchronizes package builds across a few architectures. Currently, if a missing symbol goes undetected due to a hash collision, it does so on all architectures simultaneously. To change that, we need to hash symbols differently, depending on the target architecture (we can pass seed=3Dhash(arch) to the hash function, or use different initialization vectors IV=3Dhash(arch)). The desired outcome is that when a missing symbol goes undetected, it does so, with an overwhelming probability, on only one target. Roughly speaking, if the probability of an undetected missing symbol is 10^-3 on a single target, it must be 10^-6 on two targets, 10^-9 on thee targets, etc.