From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa.local.altlinux.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=esoWY38SbJ1kxZVfJDVcECApG0MyhbUOnmSqisFGegg=; b=ebH/37i/uoCF7VZcYoslfs9vk2s88wqhXxK/dA28eDu2VBZROne5NAD8/Uhp/mi20y fJ/rWJ5V71C9Gc8RDPYxhbzW4/NUqQKBSnTJdcdeesDaxPTj3HbPnfsZzotyJy4OugzS ZIQHZdAcK447A3T+dFBTKl27vgfF1xu2LxduKf6+dTdNaKruD1dX9Al6Jwij5fSPAPn4 bRv4pGiFL9Q1CqYvUc1MHse/eE2g63+qWgaGOHF5jzI2GDPVIWSbd3k/1KprewKVBjPI yoEbalM2aHMewLa6frxCAFJsiZxVvZltxcovRs8jQEakd3U3fPw6xX4WfzObemY02wn2 T9rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=esoWY38SbJ1kxZVfJDVcECApG0MyhbUOnmSqisFGegg=; b=qz0lVBMRkzMPts+m/TiR8bxVvwWCc5Y7uCVV3z+rlpBDCyM7z3lNvbigJlvbeLbRpS D3je6qLZ2ekBWGlyNvE6j5j/NsP36E0XnLhjvW8Bp7kXP3izO5o4KKS+SruTkZ14B6/5 +3xiufxfc8GAIMFVfUmkWTTD3n0ZCSk59QPKRVViafC4wZZoYZHknYFiTINIFBW7fdue Xe9M63XlcWs/BFQGQxmNlwf+x0w9WwNCSZoWdAdSZeKN4JCjR5KQ7VurBdIVXOOBhDyZ QBIwQfGQUdAJrfFm0lN+hlm5yNqCX6pCg6BigqmshvkjXxL+zODhYmN67Ax9dWGjQiGV ytFw== X-Gm-Message-State: AGi0Pub/dbB1F7R/P7t0rEQXfvc4UKH8DNvsQT9zPB3mQlJW8GSkClrM +rA53CJGv707to3gqZm9pGSfg6dcfGoEoejYidSJgV15 X-Google-Smtp-Source: APiQypLbh6Y0oTMNFKgj60zswrjS5LMsrZSx9JzMIkA3zOt+u66cE44Ll6llWEQoKM7jB4jfdcaO3HKt/Btk4nc6HGk= X-Received: by 2002:a5d:9f15:: with SMTP id q21mr10379iot.111.1587073923177; Thu, 16 Apr 2020 14:52:03 -0700 (PDT) MIME-Version: 1.0 References: <20200410231044.1436970-1-vseleznv@altlinux.org> <20200410231044.1436970-3-vseleznv@altlinux.org> <20200414164244.GC618226@portlab> In-Reply-To: <20200414164244.GC618226@portlab> From: Alexey Tourbin Date: Fri, 17 Apr 2020 00:51:51 +0300 Message-ID: To: ALT Linux Team development discussions Content-Type: text/plain; charset="UTF-8" Subject: Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo X-BeenThere: devel@lists.altlinux.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: ALT Linux Team development discussions List-Id: ALT Linux Team development discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2020 21:52:08 -0000 Archived-At: List-Archive: List-Post: On Tue, Apr 14, 2020 at 7:42 PM Vladimir D. Seleznev wrote: > > Then suppose I build a gearifeid package from Sisyphus for p9. But > > your code only handles GB_REPO_DIR, not the NEIGHBOUR_REPO_DIR the > > package comes from. To be clear, that information is lost: when you > > request to build a signed tag from /gears, it does not imply that > > there is a corresponding .src.rpm in any REPO_DIR. > > It's future part. I wrote some code that check the uprepos, but I didn't > like it. The correct way is checking uprepos archives as well. I gave it some more thought. First, the way you're trying to hash all the unknown tags is interesting, but I wouldn't do that. You may want to hash everything if you don't understand the internal structure and prefer to treat the input as a black box. On the contrary, we understand the internals of a package well. What's then a minimal subset of tags to identify a package? For a source package, it's FileDigests + BuildRequires + BuildConflicts. That's all! They determine the outcome of a build, and we may reasonably postulate that the rest of the tags should not influence that outcome. You don't even have to hash NEVR, because the specfile is already in FileDigests. (But you should probably hash FileFlags, because they point out which file is the specfile. You should also hash FileModes, because some sources may be executable. But that's about all.) Second, referring to the discussion about hash functions, the hash function you're using isn't all that important. That's because you're hashing MD5 sums in FileDigests, and those are the weakest link and (theoretically) the main cause of any collision. The speed isn't important either, because you're hashing relatively short inputs. So what's the right set of tags for a binary package, and what is its identity? (I'm not sure identity is the right word, I would rather call it ID. Identity is who you are and what you believe in, for example a black person who votes for Obama.) I've already hinted that identity can be defined via substitution: if you replace a package with a different package but the same identity, there should be no functional difference, and furthermore no difference "for all intents and purposes", except for a few observable differences which we deem immaterial and permit explicitly, such as FileMtimes. So obviously you need to hash at least FileDigests and Requires/Provides/Obsoletes/Conflicts. This should satisfy the definition of ID for rpm (the dependencies are satisfied in the same way, and file conflicts are the resolved in the same way, so rpm can't tell the difference if we make a substitution.) It isn't clear whether you should hash informational sections such as %description. It can be argued that under the same NEVR, the description shouldn't change anyway. Is it possible that nothing changes in a package but the description? Would we still want to update/replace the package then? Finally, your identity hash need not to be fixed once and forever. It is used only for internal bookkeeping, so once in a while you are allowed to change the hash and rebuild the identity-addressable storage. You should have a script for that in girar/admin. It may take an hour or so to complete, but that's not too bad. > > There is already a problem with cross-repo copying: if done in > > earnest, both repos need to be locked. And of course this is > > deadlock-prone. You can do better without any locking if you identify > > every package in all repos with your new identity hash. This can be > > done relatively easy, since you already have that big > > content-addressable storage. You can hardlink it into a shadow > > identity-addressable storage. Once you've done that, you obtain the > > global / beatific vision: given a package, you instantly know if you > > have already seen something like this. (On the second thought: you > > don't need locking because the -f test is atomic and files cannot be > > removed from the storage, but there will still be race conditions. > > It's not too bad in practice. Further those race conditions can be > > detected at the task-commit stage.) > > I like the idea, but there are some issues with this solution: these > *are* collisions. I explain this below, but this idea will work > perfectly with sourcerpms. > > The problem is that if we want to hande binary rpms as well, there will > be kind of collisions by design. For example, package foo has two > subpackages: foo-data and libfoo. After foo rebuild foo-data has the > same identity as previous foo-data build, but libfoo has the different > now. According the plan, the whole rebuild has significant changes and > all binary packages should be substituted with new one. And now we have > two foo-data packages with the same identity value, but they are belong > to different builds. > > > There is one specific problem with the outlined approach: the notion > > of identity is flawed, because the disttag may or may not matter. > > Sometimes you cannot substitute a package for another package with the > > same identity but a different disttag. Specifically this is the case > > with strict dependencies between subpackages. You cannot substitute a > > subpackage unless you also substitute all the other subpackages. > > Yes, that is correct, I considered this. So for src.rpm packages, it's a solved problem. For binary packages, the identity should specifically exclude disttag. It will no longer satisfy the definition of ID for rpm (substitution will break for subpackages with strict dependencies). Therefore for binary packages, we need to track tuples. This is a one-to-many relation: for each ID, there may be a few disttags. So for binary packages we need a separate identity-addressable storage which maps ID to (while for source packages, a hardlink maps ID to filehash). If implemented naively, this will create many small files, one file per ID, most files with just one line. In a more practical implementation, you should probably group all those small files by package name. So you'll have: $ cat id2f/libfoo $ cat id2f/foo-data Note that for libfoo, the IDs are different, but with foo-data the IDs are the same. This indicates that the contents of libfoo have changed after a rebuild, while the contents of foo-data have not. Suppose you have such a store, and foo.src.rpm is getting rebuilt again (or copied to p9). You can then check up with the store and see if the outcome can be replaced with either libfoo-filehash1 + foo-data-filehash1 (with disttag1) or libfoo-filehash2 + foo-data-filehash2 (with disttag2), but not in other combinations. You'll need an elaborate algorithm which coordinates substitutions across architectures, but this seems doable.