From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 14 Apr 2020 19:20:00 +0300 From: "Vladimir D. Seleznev" To: ALT Linux Team development discussions Message-ID: <20200414162000.GA618226@portlab> References: <20200410231044.1436970-1-vseleznv@altlinux.org> <20200411133631.daac861f97979c67511cf3ef@altlinux.org> <20200411233143.GC4490@altlinux.org> <20200414175713.7355b93735c94869697c5610@altlinux.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20200414175713.7355b93735c94869697c5610@altlinux.org> User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [devel] RFC: girar: optimize rebuild X-BeenThere: devel@lists.altlinux.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: ALT Linux Team development discussions List-Id: ALT Linux Team development discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Apr 2020 16:20:01 -0000 Archived-At: List-Archive: List-Post: On Tue, Apr 14, 2020 at 05:57:13PM +0300, Andrey Savchenko wrote: > On Sun, 12 Apr 2020 02:31:43 +0300 Alexey V. Vissarionov wrote: > > On 2020-04-11 13:36:31 +0300, Andrey Savchenko wrote: > > > > >> The first part of rebuilt packages optimization for girar. > > >> It introduces pkg_identity() and simple optimization of the > > >> rebuilt sourcerpm. > > >> pkg_identity() takes RPM package and returns a value called > > >> package identity, a hash of subset of RPM package header. > > >> That subset is the entire header without some nonessential > > >> artifacts like buildhost, buildtime, header hashsum, etc. > > > I see two problems with proposed approach: > > > 1) It assumes there will be not pkg_identity hash collisions. > > > This is wrong. They may occur sooner or later and the code > > > *must* correctly deal with such collisions. > > > > The solution is well known: prefix the hash with a time_t value > > to let it grow monotonously while still being strictly dependent > > on sensitive data. > > Yes, this is a good idea. I don't get the idea. > > Whether we'd face a hash collision, we could check whether the > > timestamps differ significantly. > > > > > 2) The hash function choise — sha256 ­— is very unfortunate: > > > it has longer digest than sha1, but otherwise is vulnerable > > > to the same attack; so right now it is still marginally secure, > > > but it will not last long. > > We don't really need any cryptographic-grade hash function here: > > all we need is just a checksum with a good distribution to detect > > whether something had changed - obviously enough, nobody would > > try to build and exploit collisions here. Said that, we can use > > almost any polynomial. > > Still it may be a security issue. Consider what will happen if > wrong source rpm will be used: new modifications including security > fixes may be silently omitted from a branch. Nothing bad will happen. I see you don't understand the task: it's not about neither the new modifications or new releases. It's only about package rebuild. It uses no new sources. > > > Moreover sha256 is quite slow. > > > > SHA2 is implemented in the hardware in some modern CPUs, so it's > > quite fast there. > > Only in some and only for amd64 arch. But our man build infrastructure > also uses ppc64le and aarch64, so it is very important to be > efficient, especially on aarch64 which is a bottleneck for most > tasks. And consider that we have secondary build systems for other > arches like mips, riscv, e2k. > > A talk is cheap, so let's see some some numbers. > > 0) dd if=/dev/urandom of=/tmp/test.file bs=1M count=2048 > > 1) Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > $ time sha256sum -b /tmp/test.file > 8.67user 0.27system 0:08.94elapsed 99%CPU (0avgtext+0avgdata 1944maxresident)k > 8.70user 0.25system 0:08.96elapsed 99%CPU (0avgtext+0avgdata 2148maxresident)k > 8.65user 0.28system 0:08.93elapsed 99%CPU (0avgtext+0avgdata 2064maxresident)k > > $ time b2sum -b /tmp/test.file > 2.48user 0.32system 0:02.81elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k > 2.46user 0.30system 0:02.76elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k > 2.47user 0.29system 0:02.77elapsed 99%CPU (0avgtext+0avgdata 2068maxresident)k > > 2) E8C (1300 MHz, MBE8C-PC v.2) > $ time sha256sum -b /tmp/test.file > 11.69user 0.93system 0:12.64elapsed 99%CPU (0avgtext+0avgdata 3784maxresident)k > 11.78user 0.85system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3836maxresident)k > 11.72user 0.90system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3956maxresident)k > > $ time b2sum -b /tmp/test.file > 6.90user 1.37system 0:08.27elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k > 6.76user 1.10system 0:07.87elapsed 99%CPU (0avgtext+0avgdata 3844maxresident)k > 6.93user 0.95system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 3872maxresident)k > > I see no reason for using slower and less secure sha256 algorithm. We can use more faster algorithm. Again, it is not about security. -- WBR, Vladimir D. Seleznev