[devel] RFC: girar: optimize rebuild

ALT Linux Team development discussions
 help / color / mirror / Atom feed

* [devel] RFC: girar: optimize rebuild
@ 2020-04-10 23:10 Vladimir D. Seleznev
  2020-04-10 23:10 ` [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity() Vladimir D. Seleznev
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-10 23:10 UTC (permalink / raw)
  To: devel; +Cc: vseleznv

Hi!

The first part of rebuilt packages optimization for girar. It introduces
pkg_identity() and simple optimization of the rebuilt sourcerpm.

pkg_identity() takes RPM package and returns a value called package identity,
a hash of subset of RPM package header. That subset is the entire header
without some nonessential artifacts like buildhost, buildtime, header hashsum,
etc.

The two package builds of the same NEVR might have equal or different
package identities. The equal identities mean that build results of these
packages are equal too, that allows build optimization. The practical
example of simple rebuilt sourcerpm optimization also introduced.

The future work can be about optimization of "copied" to another branch
sourcerpm with retrieved from archive sourcerpm, and binary packages
optimization (this case has an issue when binary subpackages are mixed
archs, i.e. arch and noarch, this probably could work only with single-arch
builds).

Please review and discuss.

--
   WBR,
   Vladimir D. Seleznev

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity()
  2020-04-10 23:10 [devel] RFC: girar: optimize rebuild Vladimir D. Seleznev
@ 2020-04-10 23:10 ` Vladimir D. Seleznev
  2020-04-13 18:01   ` Dmitry V. Levin
  2020-04-10 23:10 ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Vladimir D. Seleznev
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-10 23:10 UTC (permalink / raw)
  To: devel; +Cc: vseleznv

---
 gb/gb-sh-functions | 112 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)

diff --git a/gb/gb-sh-functions b/gb/gb-sh-functions
index cd9039a..d7b1980 100644
--- a/gb/gb-sh-functions
+++ b/gb/gb-sh-functions
@@ -283,4 +283,115 @@ rpm_changes_since()
 
 	gb-x-changelog-complement ${tmpdir}/changelog_old ${tmpdir}/changelog_new
 }
+
+pkg_identity()
+{
+	local keep_branch= pkg=
+	pkg="${1-}"; shift
+	if [ "$pkg" = "--with-branch" ]; then
+		pkg="${1-}"; shift
+		keep_branch=1
+	fi
+
+	# List of rpm tags that should be filtered
+	#
+	# RPM tags that contain insufficient information about package
+	# contents and relationship, and do not affect package functionality
+	# should be filtered.
+	#
+	# The main criterias for tags to be filtered:
+	#
+	# - Tag contains random or not reproducible value that is assigning
+	#   during the build, and this value does not affect package
+	#   functionality;
+	# - Tag contains metadata about build host properties;
+	# - Tag contains metadata of package headers, including its signatures;
+	# - Tag is only related to package database;
+	# - Other reasons that are considered worthy.
+	cat >"$tmpdir"/filtertags <<EOF
+ARCH
+ARCHIVESIZE
+AUTOINSTALLED
+BUILDARCHS
+BUILDHOST
+BUILDTIME
+COOKIE
+DBINSTANCE
+DISTRIBUTION
+DISTTAG
+DISTURL
+DSAHEADER
+FILESTATES
+HDRID
+HEADERCOLOR
+HEADERI18NTABLE
+HEADERIMAGE
+HEADERIMMUTABLE
+HEADERREGIONS
+HEADERSIGNATURES
+IDENTITY
+INSTALLCOLOR
+INSTALLTID
+INSTALLTIME
+INSTFILENAMES
+INSTPREFIXES
+LONGARCHIVESIZE
+LONGSIGSIZE
+LONGSIZE
+PKGID
+RPMVERSION
+RSAHEADER
+SHA1HEADER
+SIGGPG
+SIGLEMD5_1
+SIGMD5
+SIGPGP
+SIGSIZE
+SIZE
+SOURCEPKGID
+EOF
+
+	local rpm_version="$(rpm --version)"
+	# tag extensions do not exists in the rpm 4.0.4
+	if [ "$rpm_version" != "RPM version 4.0.4" ]; then
+	    rpm --verbose --querytags |
+	    while read tagname tagval tagtype ext; do
+		    if [ "$ext" = "ext" ]; then
+			    # filter all tag extensions too
+			    echo "$tagname" >> "$tmpdir"/filtertags
+		    fi
+	    done
+	fi
+
+	sort -o "$tmpdir"/filtertags{,}
+	rpmquery --querytags |sort >"$tmpdir"/querytags
+	join -v1 "$tmpdir/querytags" "$tmpdir/filtertags" >"$tmpdir"/tags
+
+	# erase extra apt indices tags
+	sed -i '/APTINDEX/d' "$tmpdir"/tags
+
+	# construct query format string in form "[tag:%{tag:shescape}\n]"
+	local qf="$(sed -E 's/^(.+)$/[\1:%{\1:shescape}\\n]/' "$tmpdir"/tags)"
+
+	# we want to filter disttag part of provides and requires ...
+	local disttag= var_disttag=
+	disttag="$(rpmquery --qf '%{disttag}' -p "$pkg")"
+	quote_sed_regexp_variable var_disttag "$disttag"
+
+	# ... except cases when we want to keep branch name
+	local branch_name=
+	if [ -n "$keep_branch" ]; then
+		branch_name=":${disttag%%+*}"
+	fi
+
+	(rpmquery --qf "${qf}\n" -p "$pkg" || touch "$tmpdir"/FAIL) |
+		sed "s/^\(\(PROVIDEVERSION\|REQUIREVERSION\):.*\):$var_disttag$/\1$branch_name/g" |
+		sed '/^$/d' |
+		sha256sum - |
+		cut -d" " -f1
+
+	if [ -f "$tmpdir"/FAIL ]; then
+		stamp_echo >&2 "Cannot calculate package identity of $pkg"
+		return 1
+	fi
+}
-- 
2.25.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-10 23:10 [devel] RFC: girar: optimize rebuild Vladimir D. Seleznev
  2020-04-10 23:10 ` [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity() Vladimir D. Seleznev
@ 2020-04-10 23:10 ` Vladimir D. Seleznev
  2020-04-11 11:29   ` Alexey Tourbin
  2020-04-11 10:36 ` [devel] RFC: girar: optimize rebuild Andrey Savchenko
  2020-04-11 11:04 ` Gleb Fotengauer-Malinovskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-10 23:10 UTC (permalink / raw)
  To: devel; +Cc: vseleznv

---
 gb/gb-task-check-build-i | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/gb/gb-task-check-build-i b/gb/gb-task-check-build-i
index 9017e12..38a6ab3 100755
--- a/gb/gb-task-check-build-i
+++ b/gb/gb-task-check-build-i
@@ -249,10 +249,26 @@ for arch in $GB_ARCH; do
 	fi
 done
 
+osrpm_identity=
+osrpm="$GB_REPO_DIR/files/SRPMS/$srpmsu"
+if [ -f "$osrpm" ]; then
+	echo >&2 "$I: Found $srpmsu in the repo, this means the package was rebuilt"
+	osrpm_identity="$(pkg_identity "$osrpm")"
+fi
+
 for arch in $GB_ARCH; do
 	[ -d "$arch/srpm" -o ! -s "$arch/excluded" ] || continue
 	f="$arch/srpm/$srpmsu"
 	[ -f "$f" ] || continue
+	srpm_identity="$(pkg_identity "$f")"
+	echo >&2 "$I: $arch $srpmsu identity = $srpm_identity"
+	# non-empty $osrpm_identity means the NEVR was rebuilt
+	# optimize rebuilt sourcerpm if identities of original and rebuilt sourcerpms are equal
+	if [ -n "$osrpm_identity" ] &&
+		   [ "$osrpm_identity" = "$srpm_identity" ]; then
+		echo >&2 "$I: $arch: optimize rebuilt $srpmsu cause its identity is equal to $srpmsu in the repo"
+		install -p "$osrpm" "$f"
+	fi
 	built_pkgname="$(rpmquery --qf '%{name}' -p -- "$f")"
 	echo "$built_pkgname" > pkgname
 	break
-- 
2.25.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-10 23:10 [devel] RFC: girar: optimize rebuild Vladimir D. Seleznev
  2020-04-10 23:10 ` [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity() Vladimir D. Seleznev
  2020-04-10 23:10 ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Vladimir D. Seleznev
@ 2020-04-11 10:36 ` Andrey Savchenko
  2020-04-11 15:33   ` Vladimir D. Seleznev
  2020-04-11 23:31   ` Alexey V. Vissarionov
  2020-04-11 11:04 ` Gleb Fotengauer-Malinovskiy
  3 siblings, 2 replies; 22+ messages in thread
From: Andrey Savchenko @ 2020-04-11 10:36 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 2186 bytes --]

On Sat, 11 Apr 2020 02:10:42 +0300 Vladimir D. Seleznev wrote:
> 
> Hi!
> 
> The first part of rebuilt packages optimization for girar. It introduces
> pkg_identity() and simple optimization of the rebuilt sourcerpm.
> 
> pkg_identity() takes RPM package and returns a value called package identity,
> a hash of subset of RPM package header. That subset is the entire header
> without some nonessential artifacts like buildhost, buildtime, header hashsum,
> etc.
> 
> The two package builds of the same NEVR might have equal or different
> package identities. The equal identities mean that build results of these
> packages are equal too, that allows build optimization. The practical
> example of simple rebuilt sourcerpm optimization also introduced.
> 
> The future work can be about optimization of "copied" to another branch
> sourcerpm with retrieved from archive sourcerpm, and binary packages
> optimization (this case has an issue when binary subpackages are mixed
> archs, i.e. arch and noarch, this probably could work only with single-arch
> builds).
> 
> Please review and discuss.

I see two problems with proposed approach:

1) It assumes there will be not pkg_identity hash collisions. This
is wrong. They may occur sooner or later and the code *must*
correctly deal with such collisions. Remember what happened to
subversion when collision occurred in a repository, while git was
resilient.

The way proposal is now the identity hash collision will lead to
undergraded repository at best and broken at worst.

I see no easy way to fix this problem, but it must be either fixed
or proposed optimization rejected.

2) The hash function choise — sha256 — is very unfortunate: it has
longer digest than sha1, but otherwise is vulnerable to the same
attack; so right now it is still marginally secure, but it will not
last long. Moreover sha256 is quite slow.

It is better to use newer generation of hash functions, e.g.
blake2b based on the chacha stream cipher. It is more future proof
and faster at the same time. You can just use the b2sum
implementation from the GNU coreutils.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-10 23:10 [devel] RFC: girar: optimize rebuild Vladimir D. Seleznev
                   ` (2 preceding siblings ...)
  2020-04-11 10:36 ` [devel] RFC: girar: optimize rebuild Andrey Savchenko
@ 2020-04-11 11:04 ` Gleb Fotengauer-Malinovskiy
  2020-04-11 15:21   ` Vladimir D. Seleznev
  3 siblings, 1 reply; 22+ messages in thread
From: Gleb Fotengauer-Malinovskiy @ 2020-04-11 11:04 UTC (permalink / raw)
  To: ALT Linux Team development discussions; +Cc: vseleznv

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

Hi,

On Sat, Apr 11, 2020 at 02:10:42AM +0300, Vladimir D. Seleznev wrote:
> 
> Hi!
> 
> The first part of rebuilt packages optimization for girar. It introduces
> pkg_identity() and simple optimization of the rebuilt sourcerpm.

Why do we rebuild source rpm at all when we already have one?  I mean,
when we use hasher with --query-repackage this new rebuilt source rpm is
no better then original one.

I think we can always save the original source rpm when we rebuild
a package or copy it from branch to branch (like we actually do for
packages originally built from src.rpm-s).

> pkg_identity() takes RPM package and returns a value called package identity,
> a hash of subset of RPM package header. That subset is the entire header
> without some nonessential artifacts like buildhost, buildtime, header hashsum,
> etc.
> 
> The two package builds of the same NEVR might have equal or different
> package identities. The equal identities mean that build results of these
> packages are equal too, that allows build optimization. The practical
> example of simple rebuilt sourcerpm optimization also introduced.

Did you consider adding all this identity logic on the rpm's side (as a
standalone helper may be)?  I personally don't like the whole idea
of tracking rpm tags status on girar side.  Also, this helper may be
useful outside of girar.

> The future work can be about optimization of "copied" to another branch
> sourcerpm with retrieved from archive sourcerpm, and binary packages
> optimization (this case has an issue when binary subpackages are mixed
> archs, i.e. arch and noarch, this probably could work only with single-arch
> builds).

Looks like a good plan.  I think optimization of binary packages is more
important then optimization which looks for archived packages.
We may want to take binary packages from archive too anyway.

-- 
glebfm

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-10 23:10 ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Vladimir D. Seleznev
@ 2020-04-11 11:29   ` Alexey Tourbin
  2020-04-14 16:42     ` Vladimir D. Seleznev
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Tourbin @ 2020-04-11 11:29 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Sat, Apr 11, 2020 at 2:11 AM Vladimir D. Seleznev
<vseleznv@altlinux.org> wrote:
> +osrpm_identity=
> +osrpm="$GB_REPO_DIR/files/SRPMS/$srpmsu"
> +if [ -f "$osrpm" ]; then
> +       echo >&2 "$I: Found $srpmsu in the repo, this means the package was rebuilt"
> +       osrpm_identity="$(pkg_identity "$osrpm")"
> +fi
> +
>  for arch in $GB_ARCH; do
>         [ -d "$arch/srpm" -o ! -s "$arch/excluded" ] || continue
>         f="$arch/srpm/$srpmsu"
>         [ -f "$f" ] || continue
> +       srpm_identity="$(pkg_identity "$f")"
> +       echo >&2 "$I: $arch $srpmsu identity = $srpm_identity"
> +       # non-empty $osrpm_identity means the NEVR was rebuilt
> +       # optimize rebuilt sourcerpm if identities of original and rebuilt sourcerpms are equal
> +       if [ -n "$osrpm_identity" ] &&
> +                  [ "$osrpm_identity" = "$srpm_identity" ]; then
> +               echo >&2 "$I: $arch: optimize rebuilt $srpmsu cause its identity is equal to $srpmsu in the repo"
> +               install -p "$osrpm" "$f"
> +       fi
>         built_pkgname="$(rpmquery --qf '%{name}' -p -- "$f")"
>         echo "$built_pkgname" > pkgname
>         break

So how does it work in practice? Suppose I first uploaded a .src.rpm
package. Do we store the original src.rpm, the one with the uploader's
signature? When it gets rebuilt, this should not affect the original
.src.rpm (as if it was uploaded again). No special handling is
required in this case.

Then suppose I build a gearifeid package from Sisyphus for p9. But
your code only handles GB_REPO_DIR, not the NEIGHBOUR_REPO_DIR the
package comes from. To be clear, that information is lost: when you
request to build a signed tag from /gears, it does not imply that
there is a corresponding .src.rpm in any REPO_DIR.

There is already a problem with cross-repo copying: if done in
earnest, both repos need to be locked. And of course this is
deadlock-prone. You can do better without any locking if you identify
every package in all repos with your new identity hash. This can be
done relatively easy, since you already have that big
content-addressable storage. You can hardlink it into a shadow
identity-addressable storage. Once you've done that, you obtain the
global / beatific vision: given a package, you instantly know if you
have already seen something like this. (On the second thought: you
don't need locking because the -f test is atomic and files cannot be
removed from the storage, but there will still be race conditions.
It's not too bad in practice. Further those race conditions can be
detected at the task-commit stage.)

There is one specific problem with the outlined approach: the notion
of identity is flawed, because the disttag may or may not matter.
Sometimes you cannot substitute a package for another package with the
same identity but a different disttag. Specifically this is the case
with strict dependencies between subpackages. You cannot substitute a
subpackage unless you also substitute all the other subpackages. This
is further complicated by noarch subpackages: you need to coordinate
substitution across architectures.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-11 11:04 ` Gleb Fotengauer-Malinovskiy
@ 2020-04-11 15:21   ` Vladimir D. Seleznev
  2020-04-11 16:41     ` Gleb Fotengauer-Malinovskiy
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-11 15:21 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Sat, Apr 11, 2020 at 02:04:25PM +0300, Gleb Fotengauer-Malinovskiy wrote:
> Hi,
> 
> On Sat, Apr 11, 2020 at 02:10:42AM +0300, Vladimir D. Seleznev wrote:
> > 
> > Hi!
> > 
> > The first part of rebuilt packages optimization for girar. It introduces
> > pkg_identity() and simple optimization of the rebuilt sourcerpm.
> 
> Why do we rebuild source rpm at all when we already have one?  I mean,
> when we use hasher with --query-repackage this new rebuilt source rpm is
> no better then original one.
> 
> I think we can always save the original source rpm when we rebuild
> a package or copy it from branch to branch (like we actually do for
> packages originally built from src.rpm-s).

I'm sorry, I was not clear. Sure when a package is built from the
sourcerpm, no optimization is required in this case as girar saves only
original sourcerpm. The different things happen when package is built
from the gear. In the case when package is rebuilt from the gear, girar
produce new source and binary rpms, and when the rebuilt task is done it
saves all these new source and binary rpms. The proposed optimization is
aimed for that case.

> > pkg_identity() takes RPM package and returns a value called package identity,
> > a hash of subset of RPM package header. That subset is the entire header
> > without some nonessential artifacts like buildhost, buildtime, header hashsum,
> > etc.
> > 
> > The two package builds of the same NEVR might have equal or different
> > package identities. The equal identities mean that build results of these
> > packages are equal too, that allows build optimization. The practical
> > example of simple rebuilt sourcerpm optimization also introduced.
> 
> Did you consider adding all this identity logic on the rpm's side (as a
> standalone helper may be)?  I personally don't like the whole idea
> of tracking rpm tags status on girar side.  Also, this helper may be
> useful outside of girar.

I did, but it's a bit complicated. RPM community likes the idea, but
there is no consensus about how it should work. Sure each project can
realize it by its own specific way.

So, whether we should calculate the package identity in the girar side
or the rpm side? If it should be on rpm side, should it support rpm
4.0.4?

> > The future work can be about optimization of "copied" to another branch
> > sourcerpm with retrieved from archive sourcerpm, and binary packages
> > optimization (this case has an issue when binary subpackages are mixed
> > archs, i.e. arch and noarch, this probably could work only with single-arch
> > builds).
> 
> Looks like a good plan.  I think optimization of binary packages is more
> important then optimization which looks for archived packages.
> We may want to take binary packages from archive too anyway.

Ok.

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-11 10:36 ` [devel] RFC: girar: optimize rebuild Andrey Savchenko
@ 2020-04-11 15:33   ` Vladimir D. Seleznev
  2020-04-11 23:31   ` Alexey V. Vissarionov
  1 sibling, 0 replies; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-11 15:33 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Sat, Apr 11, 2020 at 01:36:31PM +0300, Andrey Savchenko wrote:
> On Sat, 11 Apr 2020 02:10:42 +0300 Vladimir D. Seleznev wrote:
> > 
> > Hi!
> > 
> > The first part of rebuilt packages optimization for girar. It introduces
> > pkg_identity() and simple optimization of the rebuilt sourcerpm.
> > 
> > pkg_identity() takes RPM package and returns a value called package identity,
> > a hash of subset of RPM package header. That subset is the entire header
> > without some nonessential artifacts like buildhost, buildtime, header hashsum,
> > etc.
> > 
> > The two package builds of the same NEVR might have equal or different
> > package identities. The equal identities mean that build results of these
> > packages are equal too, that allows build optimization. The practical
> > example of simple rebuilt sourcerpm optimization also introduced.
> > 
> > The future work can be about optimization of "copied" to another branch
> > sourcerpm with retrieved from archive sourcerpm, and binary packages
> > optimization (this case has an issue when binary subpackages are mixed
> > archs, i.e. arch and noarch, this probably could work only with single-arch
> > builds).
> > 
> > Please review and discuss.
> 
> I see two problems with proposed approach:
> 
> 1) It assumes there will be not pkg_identity hash collisions. This
> is wrong. They may occur sooner or later and the code *must*
> correctly deal with such collisions. Remember what happened to
> subversion when collision occurred in a repository, while git was
> resilient.

Any hashsum function has collisions by definition. The only way to avoid
them is not to use hashsums.

> The way proposal is now the identity hash collision will lead to
> undergraded repository at best and broken at worst.

No, it will not, cause any issues that this collision might bring up
will be caught by later build checks.

> I see no easy way to fix this problem, but it must be either fixed
> or proposed optimization rejected.
> 
> 2) The hash function choise — sha256 — is very unfortunate: it has
> longer digest than sha1, but otherwise is vulnerable to the same
> attack; so right now it is still marginally secure, but it will not
> last long. Moreover sha256 is quite slow.

The good news: it is not about security.

> It is better to use newer generation of hash functions, e.g.
> blake2b based on the chacha stream cipher. It is more future proof
> and faster at the same time. You can just use the b2sum
> implementation from the GNU coreutils.

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-11 15:21   ` Vladimir D. Seleznev
@ 2020-04-11 16:41     ` Gleb Fotengauer-Malinovskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Gleb Fotengauer-Malinovskiy @ 2020-04-11 16:41 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 1828 bytes --]

On Sat, Apr 11, 2020 at 06:21:01PM +0300, Vladimir D. Seleznev wrote:
> On Sat, Apr 11, 2020 at 02:04:25PM +0300, Gleb Fotengauer-Malinovskiy wrote:
> > Hi,
> > 
> > On Sat, Apr 11, 2020 at 02:10:42AM +0300, Vladimir D. Seleznev wrote:
> > > 
> > > Hi!
> > > 
> > > The first part of rebuilt packages optimization for girar. It introduces
> > > pkg_identity() and simple optimization of the rebuilt sourcerpm.
> > 
> > Why do we rebuild source rpm at all when we already have one?  I mean,
> > when we use hasher with --query-repackage this new rebuilt source rpm is
> > no better then original one.
> > 
> > I think we can always save the original source rpm when we rebuild
> > a package or copy it from branch to branch (like we actually do for
> > packages originally built from src.rpm-s).
> 
> I'm sorry, I was not clear.

No, I think you were clear enough.  Unlike me, obviously. :)

> Sure when a package is built from the sourcerpm, no optimization is
> required in this case as girar saves only original sourcerpm.

Yes.

> The different things happen when package is built
> from the gear. In the case when package is rebuilt from the gear, girar
> produce new source and binary rpms, and when the rebuilt task is done it
> saves all these new source and binary rpms. The proposed optimization is
> aimed for that case.

Yes, I was talking about this same case.
I think we don't need to bother with identity of src.rpm-s at all.
If we build two src.rpm-s from two identical pkg.tar-s we still get
equivalent src.rpm-s because in --query-repackage mode hasher uses src.rpm
as a source archive (buildtime is the only read rpm tag).

Moreover, if we do not store rebuilt src.rpm-s we do not need to rebuild
it at all, we can use old src.rpm for rebuild and copy.

-- 
glebfm

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-11 10:36 ` [devel] RFC: girar: optimize rebuild Andrey Savchenko
  2020-04-11 15:33   ` Vladimir D. Seleznev
@ 2020-04-11 23:31   ` Alexey V. Vissarionov
  2020-04-14 14:57     ` Andrey Savchenko
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey V. Vissarionov @ 2020-04-11 23:31 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On 2020-04-11 13:36:31 +0300, Andrey Savchenko wrote:

 >> The first part of rebuilt packages optimization for girar.
 >> It introduces pkg_identity() and simple optimization of the
 >> rebuilt sourcerpm.
 >> pkg_identity() takes RPM package and returns a value called
 >> package identity, a hash of subset of RPM package header.
 >> That subset is the entire header without some nonessential
 >> artifacts like buildhost, buildtime, header hashsum, etc.
 > I see two problems with proposed approach:
 > 1) It assumes there will be not pkg_identity hash collisions.
 > This is wrong. They may occur sooner or later and the code
 > *must* correctly deal with such collisions.

The solution is well known: prefix the hash with a time_t value
to let it grow monotonously while still being strictly dependent
on sensitive data.

Whether we'd face a hash collision, we could check whether the
timestamps differ significantly.

 > 2) The hash function choise — sha256 — is very unfortunate:
 > it has longer digest than sha1, but otherwise is vulnerable
 > to the same attack; so right now it is still marginally secure,
 > but it will not last long.
We don't really need any cryptographic-grade hash function here:
all we need is just a checksum with a good distribution to detect
whether something had changed - obviously enough, nobody would
try to build and exploit collisions here. Said that, we can use
almost any polynomial.

 > Moreover sha256 is quite slow.

SHA2 is implemented in the hardware in some modern CPUs, so it's
quite fast there.

 > It is better to use newer generation of hash functions, e.g.
 > blake2b based on the chacha stream cipher.

Both are still quite marginal... but, once again, we should not
care of that too much - any hash function would do the job. Even
if we'd switch to another polynomial in the future.


-- 
Alexey V. Vissarionov
gremlin ПРИ altlinux ТЧК org; +vii-cmiii-ccxxix-lxxix-xlii
GPG: 0D92F19E1C0DC36E27F61A29CD17E2B43D879005 @ hkp://keys.gnupg.net


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity()
  2020-04-10 23:10 ` [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity() Vladimir D. Seleznev
@ 2020-04-13 18:01   ` Dmitry V. Levin
  2020-04-13 19:32     ` Vladimir D. Seleznev
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry V. Levin @ 2020-04-13 18:01 UTC (permalink / raw)
  To: Vladimir D. Seleznev; +Cc: ALT Devel discussion list

On Sat, Apr 11, 2020 at 02:10:43AM +0300, Vladimir D. Seleznev wrote:
> ---
>  gb/gb-sh-functions | 112 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 112 insertions(+)
> 
> diff --git a/gb/gb-sh-functions b/gb/gb-sh-functions
> index cd9039a..d7b1980 100644
> --- a/gb/gb-sh-functions
> +++ b/gb/gb-sh-functions
> @@ -283,4 +283,115 @@ rpm_changes_since()
>  
>  	gb-x-changelog-complement ${tmpdir}/changelog_old ${tmpdir}/changelog_new
>  }
> +
> +pkg_identity()
> +{
> +	local keep_branch= pkg=
> +	pkg="${1-}"; shift
> +	if [ "$pkg" = "--with-branch" ]; then
> +		pkg="${1-}"; shift
> +		keep_branch=1
> +	fi
> +
> +	# List of rpm tags that should be filtered
> +	#
> +	# RPM tags that contain insufficient information about package
> +	# contents and relationship, and do not affect package functionality
> +	# should be filtered.
> +	#
> +	# The main criterias for tags to be filtered:
> +	#
> +	# - Tag contains random or not reproducible value that is assigning
> +	#   during the build, and this value does not affect package
> +	#   functionality;
> +	# - Tag contains metadata about build host properties;
> +	# - Tag contains metadata of package headers, including its signatures;
> +	# - Tag is only related to package database;
> +	# - Other reasons that are considered worthy.
> +	cat >"$tmpdir"/filtertags <<EOF
> +ARCH
> +ARCHIVESIZE
> +AUTOINSTALLED
> +BUILDARCHS
> +BUILDHOST
> +BUILDTIME
> +COOKIE
> +DBINSTANCE
> +DISTRIBUTION
> +DISTTAG
> +DISTURL
> +DSAHEADER
> +FILESTATES
> +HDRID
> +HEADERCOLOR
> +HEADERI18NTABLE
> +HEADERIMAGE
> +HEADERIMMUTABLE
> +HEADERREGIONS
> +HEADERSIGNATURES
> +IDENTITY
> +INSTALLCOLOR
> +INSTALLTID
> +INSTALLTIME
> +INSTFILENAMES
> +INSTPREFIXES
> +LONGARCHIVESIZE
> +LONGSIGSIZE
> +LONGSIZE
> +PKGID
> +RPMVERSION
> +RSAHEADER
> +SHA1HEADER
> +SIGGPG
> +SIGLEMD5_1
> +SIGMD5
> +SIGPGP
> +SIGSIZE
> +SIZE
> +SOURCEPKGID
> +EOF

First of all, I don't think this tools belongs to girar.

> +	local rpm_version="$(rpm --version)"
> +	# tag extensions do not exists in the rpm 4.0.4
> +	if [ "$rpm_version" != "RPM version 4.0.4" ]; then
> +	    rpm --verbose --querytags |
> +	    while read tagname tagval tagtype ext; do
> +		    if [ "$ext" = "ext" ]; then
> +			    # filter all tag extensions too
> +			    echo "$tagname" >> "$tmpdir"/filtertags
> +		    fi
> +	    done
> +	fi

Second, I don't see why do you need this rpm version check.
Can't you just do this regardless of rpm version?

> +	sort -o "$tmpdir"/filtertags{,}
> +	rpmquery --querytags |sort >"$tmpdir"/querytags

We don't assume that pipefail is enabled, so we prefer
	rpmquery --querytags "$tmpdir"/querytags
	sort -u -o "$tmpdir"/querytags{,}

> +	join -v1 "$tmpdir/querytags" "$tmpdir/filtertags" >"$tmpdir"/tags
> +
> +	# erase extra apt indices tags
> +	sed -i '/APTINDEX/d' "$tmpdir"/tags

Have you ever seen a package with these tags in its header?

> +
> +	# construct query format string in form "[tag:%{tag:shescape}\n]"
> +	local qf="$(sed -E 's/^(.+)$/[\1:%{\1:shescape}\\n]/' "$tmpdir"/tags)"

This construction ignores sed exit status.


-- 
ldv


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity()
  2020-04-13 18:01   ` Dmitry V. Levin
@ 2020-04-13 19:32     ` Vladimir D. Seleznev
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-13 19:32 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Mon, Apr 13, 2020 at 09:01:43PM +0300, Dmitry V. Levin wrote:
> On Sat, Apr 11, 2020 at 02:10:43AM +0300, Vladimir D. Seleznev wrote:
> > ---
> >  gb/gb-sh-functions | 112 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 112 insertions(+)
> > 
> > diff --git a/gb/gb-sh-functions b/gb/gb-sh-functions
> > index cd9039a..d7b1980 100644
> > --- a/gb/gb-sh-functions
> > +++ b/gb/gb-sh-functions
> > @@ -283,4 +283,115 @@ rpm_changes_since()
> >  
> >  	gb-x-changelog-complement ${tmpdir}/changelog_old ${tmpdir}/changelog_new
> >  }
> > +
> > +pkg_identity()
> > +{
> > +	local keep_branch= pkg=
> > +	pkg="${1-}"; shift
> > +	if [ "$pkg" = "--with-branch" ]; then
> > +		pkg="${1-}"; shift
> > +		keep_branch=1
> > +	fi
> > +
> > +	# List of rpm tags that should be filtered
> > +	#
> > +	# RPM tags that contain insufficient information about package
> > +	# contents and relationship, and do not affect package functionality
> > +	# should be filtered.
> > +	#
> > +	# The main criterias for tags to be filtered:
> > +	#
> > +	# - Tag contains random or not reproducible value that is assigning
> > +	#   during the build, and this value does not affect package
> > +	#   functionality;
> > +	# - Tag contains metadata about build host properties;
> > +	# - Tag contains metadata of package headers, including its signatures;
> > +	# - Tag is only related to package database;
> > +	# - Other reasons that are considered worthy.
> > +	cat >"$tmpdir"/filtertags <<EOF
> > +ARCH
> > +ARCHIVESIZE
> > +AUTOINSTALLED
> > +BUILDARCHS
> > +BUILDHOST
> > +BUILDTIME
> > +COOKIE
> > +DBINSTANCE
> > +DISTRIBUTION
> > +DISTTAG
> > +DISTURL
> > +DSAHEADER
> > +FILESTATES
> > +HDRID
> > +HEADERCOLOR
> > +HEADERI18NTABLE
> > +HEADERIMAGE
> > +HEADERIMMUTABLE
> > +HEADERREGIONS
> > +HEADERSIGNATURES
> > +IDENTITY
> > +INSTALLCOLOR
> > +INSTALLTID
> > +INSTALLTIME
> > +INSTFILENAMES
> > +INSTPREFIXES
> > +LONGARCHIVESIZE
> > +LONGSIGSIZE
> > +LONGSIZE
> > +PKGID
> > +RPMVERSION
> > +RSAHEADER
> > +SHA1HEADER
> > +SIGGPG
> > +SIGLEMD5_1
> > +SIGMD5
> > +SIGPGP
> > +SIGSIZE
> > +SIZE
> > +SOURCEPKGID
> > +EOF
> 
> First of all, I don't think this tools belongs to girar.

Ok, it will be a separate tool. Should it be implemented as some
functionality of rpm, or as a side project?

> > +	local rpm_version="$(rpm --version)"
> > +	# tag extensions do not exists in the rpm 4.0.4
> > +	if [ "$rpm_version" != "RPM version 4.0.4" ]; then
> > +	    rpm --verbose --querytags |
> > +	    while read tagname tagval tagtype ext; do
> > +		    if [ "$ext" = "ext" ]; then
> > +			    # filter all tag extensions too
> > +			    echo "$tagname" >> "$tmpdir"/filtertags
> > +		    fi
> > +	    done
> > +	fi
> 
> Second, I don't see why do you need this rpm version check.
> Can't you just do this regardless of rpm version?

A new version of rpm can brings new tag extensions, so I can maintain
list of tag extensions by myself, or I can get the list from rpm
output. The second option is way more convenient.

> > +	sort -o "$tmpdir"/filtertags{,}
> > +	rpmquery --querytags |sort >"$tmpdir"/querytags

Ok.

> We don't assume that pipefail is enabled, so we prefer
> 	rpmquery --querytags "$tmpdir"/querytags
> 	sort -u -o "$tmpdir"/querytags{,}
> 
> > +	join -v1 "$tmpdir/querytags" "$tmpdir/filtertags" >"$tmpdir"/tags
> > +
> > +	# erase extra apt indices tags
> > +	sed -i '/APTINDEX/d' "$tmpdir"/tags
> 
> Have you ever seen a package with these tags in its header?

I haven't, but just for sure... But if actually they would appear in
package header we shouldn't just ignore them. I'll wipe these lines out.

> > +
> > +	# construct query format string in form "[tag:%{tag:shescape}\n]"
> > +	local qf="$(sed -E 's/^(.+)$/[\1:%{\1:shescape}\\n]/' "$tmpdir"/tags)"
> 
> This construction ignores sed exit status.

Is

    local qf="$(sed -E 's/^(.+)$/[\1:%{\1:shescape}\\n]/' "$tmpdir"/tags |touch $tmpdir/FAIL)"

    if [ -f"$tmpdir/FAIL" ]; then
    ...

OK?

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-11 23:31   ` Alexey V. Vissarionov
@ 2020-04-14 14:57     ` Andrey Savchenko
  2020-04-14 16:20       ` Vladimir D. Seleznev
  0 siblings, 1 reply; 22+ messages in thread
From: Andrey Savchenko @ 2020-04-14 14:57 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 3631 bytes --]

On Sun, 12 Apr 2020 02:31:43 +0300 Alexey V. Vissarionov wrote:
> On 2020-04-11 13:36:31 +0300, Andrey Savchenko wrote:
> 
>  >> The first part of rebuilt packages optimization for girar.
>  >> It introduces pkg_identity() and simple optimization of the
>  >> rebuilt sourcerpm.
>  >> pkg_identity() takes RPM package and returns a value called
>  >> package identity, a hash of subset of RPM package header.
>  >> That subset is the entire header without some nonessential
>  >> artifacts like buildhost, buildtime, header hashsum, etc.
>  > I see two problems with proposed approach:
>  > 1) It assumes there will be not pkg_identity hash collisions.
>  > This is wrong. They may occur sooner or later and the code
>  > *must* correctly deal with such collisions.
> 
> The solution is well known: prefix the hash with a time_t value
> to let it grow monotonously while still being strictly dependent
> on sensitive data.

Yes, this is a good idea.
 
> Whether we'd face a hash collision, we could check whether the
> timestamps differ significantly.
> 
>  > 2) The hash function choise — sha256 — is very unfortunate:
>  > it has longer digest than sha1, but otherwise is vulnerable
>  > to the same attack; so right now it is still marginally secure,
>  > but it will not last long.
> We don't really need any cryptographic-grade hash function here:
> all we need is just a checksum with a good distribution to detect
> whether something had changed - obviously enough, nobody would
> try to build and exploit collisions here. Said that, we can use
> almost any polynomial.

Still it may be a security issue. Consider what will happen if
wrong source rpm will be used: new modifications including security
fixes may be silently omitted from a branch.

>  > Moreover sha256 is quite slow.
> 
> SHA2 is implemented in the hardware in some modern CPUs, so it's
> quite fast there.

Only in some and only for amd64 arch. But our man build infrastructure
also uses ppc64le and aarch64, so it is very important to be
efficient, especially on aarch64 which is a bottleneck for most
tasks. And consider that we have secondary build systems for other
arches like mips, riscv, e2k.

A talk is cheap, so let's see some some numbers.

0) dd if=/dev/urandom of=/tmp/test.file bs=1M count=2048

1) Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
$ time sha256sum -b /tmp/test.file
8.67user 0.27system 0:08.94elapsed 99%CPU (0avgtext+0avgdata 1944maxresident)k
8.70user 0.25system 0:08.96elapsed 99%CPU (0avgtext+0avgdata 2148maxresident)k
8.65user 0.28system 0:08.93elapsed 99%CPU (0avgtext+0avgdata 2064maxresident)k

$ time b2sum -b /tmp/test.file
2.48user 0.32system 0:02.81elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
2.46user 0.30system 0:02.76elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
2.47user 0.29system 0:02.77elapsed 99%CPU (0avgtext+0avgdata 2068maxresident)k

2) E8C (1300 MHz,  MBE8C-PC v.2)
$ time sha256sum -b /tmp/test.file
11.69user 0.93system 0:12.64elapsed 99%CPU (0avgtext+0avgdata 3784maxresident)k
11.78user 0.85system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3836maxresident)k
11.72user 0.90system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3956maxresident)k

$ time b2sum -b /tmp/test.file
6.90user 1.37system 0:08.27elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k
6.76user 1.10system 0:07.87elapsed 99%CPU (0avgtext+0avgdata 3844maxresident)k
6.93user 0.95system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 3872maxresident)k

I see no reason for using slower and less secure sha256 algorithm.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] RFC: girar: optimize rebuild
  2020-04-14 14:57     ` Andrey Savchenko
@ 2020-04-14 16:20       ` Vladimir D. Seleznev
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-14 16:20 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Tue, Apr 14, 2020 at 05:57:13PM +0300, Andrey Savchenko wrote:
> On Sun, 12 Apr 2020 02:31:43 +0300 Alexey V. Vissarionov wrote:
> > On 2020-04-11 13:36:31 +0300, Andrey Savchenko wrote:
> > 
> >  >> The first part of rebuilt packages optimization for girar.
> >  >> It introduces pkg_identity() and simple optimization of the
> >  >> rebuilt sourcerpm.
> >  >> pkg_identity() takes RPM package and returns a value called
> >  >> package identity, a hash of subset of RPM package header.
> >  >> That subset is the entire header without some nonessential
> >  >> artifacts like buildhost, buildtime, header hashsum, etc.
> >  > I see two problems with proposed approach:
> >  > 1) It assumes there will be not pkg_identity hash collisions.
> >  > This is wrong. They may occur sooner or later and the code
> >  > *must* correctly deal with such collisions.
> > 
> > The solution is well known: prefix the hash with a time_t value
> > to let it grow monotonously while still being strictly dependent
> > on sensitive data.
> 
> Yes, this is a good idea.

I don't get the idea.

> > Whether we'd face a hash collision, we could check whether the
> > timestamps differ significantly.
> > 
> >  > 2) The hash function choise — sha256 — is very unfortunate:
> >  > it has longer digest than sha1, but otherwise is vulnerable
> >  > to the same attack; so right now it is still marginally secure,
> >  > but it will not last long.
> > We don't really need any cryptographic-grade hash function here:
> > all we need is just a checksum with a good distribution to detect
> > whether something had changed - obviously enough, nobody would
> > try to build and exploit collisions here. Said that, we can use
> > almost any polynomial.
> 
> Still it may be a security issue. Consider what will happen if
> wrong source rpm will be used: new modifications including security
> fixes may be silently omitted from a branch.

Nothing bad will happen. I see you don't understand the task: it's not
about neither the new modifications or new releases. It's only about
package rebuild. It uses no new sources.

> >  > Moreover sha256 is quite slow.
> > 
> > SHA2 is implemented in the hardware in some modern CPUs, so it's
> > quite fast there.
> 
> Only in some and only for amd64 arch. But our man build infrastructure
> also uses ppc64le and aarch64, so it is very important to be
> efficient, especially on aarch64 which is a bottleneck for most
> tasks. And consider that we have secondary build systems for other
> arches like mips, riscv, e2k.
> 
> A talk is cheap, so let's see some some numbers.
> 
> 0) dd if=/dev/urandom of=/tmp/test.file bs=1M count=2048
> 
> 1) Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> $ time sha256sum -b /tmp/test.file
> 8.67user 0.27system 0:08.94elapsed 99%CPU (0avgtext+0avgdata 1944maxresident)k
> 8.70user 0.25system 0:08.96elapsed 99%CPU (0avgtext+0avgdata 2148maxresident)k
> 8.65user 0.28system 0:08.93elapsed 99%CPU (0avgtext+0avgdata 2064maxresident)k
> 
> $ time b2sum -b /tmp/test.file
> 2.48user 0.32system 0:02.81elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
> 2.46user 0.30system 0:02.76elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
> 2.47user 0.29system 0:02.77elapsed 99%CPU (0avgtext+0avgdata 2068maxresident)k
> 
> 2) E8C (1300 MHz,  MBE8C-PC v.2)
> $ time sha256sum -b /tmp/test.file
> 11.69user 0.93system 0:12.64elapsed 99%CPU (0avgtext+0avgdata 3784maxresident)k
> 11.78user 0.85system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3836maxresident)k
> 11.72user 0.90system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3956maxresident)k
> 
> $ time b2sum -b /tmp/test.file
> 6.90user 1.37system 0:08.27elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k
> 6.76user 1.10system 0:07.87elapsed 99%CPU (0avgtext+0avgdata 3844maxresident)k
> 6.93user 0.95system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 3872maxresident)k
> 
> I see no reason for using slower and less secure sha256 algorithm.

We can use more faster algorithm. Again, it is not about security.

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-11 11:29   ` Alexey Tourbin
@ 2020-04-14 16:42     ` Vladimir D. Seleznev
  2020-04-16 21:51       ` Alexey Tourbin
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-14 16:42 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Sat, Apr 11, 2020 at 02:29:55PM +0300, Alexey Tourbin wrote:
> On Sat, Apr 11, 2020 at 2:11 AM Vladimir D. Seleznev
> <vseleznv@altlinux.org> wrote:
> > +osrpm_identity=
> > +osrpm="$GB_REPO_DIR/files/SRPMS/$srpmsu"
> > +if [ -f "$osrpm" ]; then
> > +       echo >&2 "$I: Found $srpmsu in the repo, this means the package was rebuilt"
> > +       osrpm_identity="$(pkg_identity "$osrpm")"
> > +fi
> > +
> >  for arch in $GB_ARCH; do
> >         [ -d "$arch/srpm" -o ! -s "$arch/excluded" ] || continue
> >         f="$arch/srpm/$srpmsu"
> >         [ -f "$f" ] || continue
> > +       srpm_identity="$(pkg_identity "$f")"
> > +       echo >&2 "$I: $arch $srpmsu identity = $srpm_identity"
> > +       # non-empty $osrpm_identity means the NEVR was rebuilt
> > +       # optimize rebuilt sourcerpm if identities of original and rebuilt sourcerpms are equal
> > +       if [ -n "$osrpm_identity" ] &&
> > +                  [ "$osrpm_identity" = "$srpm_identity" ]; then
> > +               echo >&2 "$I: $arch: optimize rebuilt $srpmsu cause its identity is equal to $srpmsu in the repo"
> > +               install -p "$osrpm" "$f"
> > +       fi
> >         built_pkgname="$(rpmquery --qf '%{name}' -p -- "$f")"
> >         echo "$built_pkgname" > pkgname
> >         break
> 
> So how does it work in practice? Suppose I first uploaded a .src.rpm
> package. Do we store the original src.rpm, the one with the uploader's
> signature? When it gets rebuilt, this should not affect the original
> .src.rpm (as if it was uploaded again). No special handling is
> required in this case.

Yes. It all was about the package build from the gear repo to not
multiply generated sourcerpms.

> Then suppose I build a gearifeid package from Sisyphus for p9. But
> your code only handles GB_REPO_DIR, not the NEIGHBOUR_REPO_DIR the
> package comes from. To be clear, that information is lost: when you
> request to build a signed tag from /gears, it does not imply that
> there is a corresponding .src.rpm in any REPO_DIR.

It's future part. I wrote some code that check the uprepos, but I didn't
like it. The correct way is checking uprepos archives as well.

> There is already a problem with cross-repo copying: if done in
> earnest, both repos need to be locked. And of course this is
> deadlock-prone. You can do better without any locking if you identify
> every package in all repos with your new identity hash. This can be
> done relatively easy, since you already have that big
> content-addressable storage. You can hardlink it into a shadow
> identity-addressable storage. Once you've done that, you obtain the
> global / beatific vision: given a package, you instantly know if you
> have already seen something like this. (On the second thought: you
> don't need locking because the -f test is atomic and files cannot be
> removed from the storage, but there will still be race conditions.
> It's not too bad in practice. Further those race conditions can be
> detected at the task-commit stage.)

I like the idea, but there are some issues with this solution: these
*are* collisions. I explain this below, but this idea will work
perfectly with sourcerpms.

The problem is that if we want to hande binary rpms as well, there will
be kind of collisions by design. For example, package foo has two
subpackages: foo-data and libfoo. After foo rebuild foo-data has the
same identity as previous foo-data build, but libfoo has the different
now. According the plan, the whole rebuild has significant changes and
all binary packages should be substituted with new one. And now we have
two foo-data packages with the same identity value, but they are belong
to different builds.

> There is one specific problem with the outlined approach: the notion
> of identity is flawed, because the disttag may or may not matter.
> Sometimes you cannot substitute a package for another package with the
> same identity but a different disttag. Specifically this is the case
> with strict dependencies between subpackages. You cannot substitute a
> subpackage unless you also substitute all the other subpackages.

Yes, that is correct, I considered this.


> This is further complicated by noarch subpackages: you need to
> coordinate substitution across architectures.

This is more complicated with mix-arch builds.

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-14 16:42     ` Vladimir D. Seleznev
@ 2020-04-16 21:51       ` Alexey Tourbin
  2020-04-17 13:54         ` Dmitry V. Levin
  2020-04-20  8:36         ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Alexey Tourbin
  0 siblings, 2 replies; 22+ messages in thread
From: Alexey Tourbin @ 2020-04-16 21:51 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Tue, Apr 14, 2020 at 7:42 PM Vladimir D. Seleznev
<vseleznv@altlinux.org> wrote:
> > Then suppose I build a gearifeid package from Sisyphus for p9. But
> > your code only handles GB_REPO_DIR, not the NEIGHBOUR_REPO_DIR the
> > package comes from. To be clear, that information is lost: when you
> > request to build a signed tag from /gears, it does not imply that
> > there is a corresponding .src.rpm in any REPO_DIR.
>
> It's future part. I wrote some code that check the uprepos, but I didn't
> like it. The correct way is checking uprepos archives as well.

I gave it some more thought.

First, the way you're trying to hash all the unknown tags is
interesting, but I wouldn't do that. You may want to hash everything
if you don't understand the internal structure and prefer to treat the
input as a black box.  On the contrary, we understand the internals of
a package well. What's then a minimal subset of tags to identify a
package? For a source package, it's FileDigests + BuildRequires +
BuildConflicts. That's all! They determine the outcome of a build, and
we may reasonably postulate that the rest of the tags should not
influence that outcome. You don't even have to hash NEVR, because the
specfile is already in FileDigests. (But you should probably hash
FileFlags, because they point out which file is the specfile. You
should also hash FileModes, because some sources may be executable.
But that's about all.)

Second, referring to the discussion about hash functions, the hash
function you're using isn't all that important.  That's because you're
hashing MD5 sums in FileDigests, and those are the weakest link and
(theoretically) the main cause of any collision. The speed isn't
important either, because you're hashing relatively short inputs.

So what's the right set of tags for a binary package, and what is its
identity? (I'm not sure identity is the right word, I would rather
call it ID. Identity is who you are and what you believe in, for
example a black person who votes for Obama.)  I've already hinted that
identity can be defined via substitution: if you replace a package
with a different package but the same identity, there should be no
functional difference, and furthermore no difference "for all intents
and purposes", except for a few observable differences which we deem
immaterial and permit explicitly, such as FileMtimes. So obviously you
need to hash at least FileDigests and
Requires/Provides/Obsoletes/Conflicts. This should satisfy the
definition of ID for rpm (the dependencies are satisfied in the same
way, and file conflicts are the resolved in the same way, so rpm can't
tell the difference if we make a substitution.)

It isn't clear whether you should hash informational sections such as
%description. It can be argued that under the same NEVR, the
description shouldn't change anyway. Is it possible that nothing
changes in a package but the description? Would we still want to
update/replace the package then?

Finally, your identity hash need not to be fixed once and forever. It
is used only for internal bookkeeping, so once in a while you are
allowed to change the hash and rebuild the identity-addressable
storage. You should have a script for that in girar/admin. It may take
an hour or so to complete, but that's not too bad.

> > There is already a problem with cross-repo copying: if done in
> > earnest, both repos need to be locked. And of course this is
> > deadlock-prone. You can do better without any locking if you identify
> > every package in all repos with your new identity hash. This can be
> > done relatively easy, since you already have that big
> > content-addressable storage. You can hardlink it into a shadow
> > identity-addressable storage. Once you've done that, you obtain the
> > global / beatific vision: given a package, you instantly know if you
> > have already seen something like this. (On the second thought: you
> > don't need locking because the -f test is atomic and files cannot be
> > removed from the storage, but there will still be race conditions.
> > It's not too bad in practice. Further those race conditions can be
> > detected at the task-commit stage.)
>
> I like the idea, but there are some issues with this solution: these
> *are* collisions. I explain this below, but this idea will work
> perfectly with sourcerpms.
>
> The problem is that if we want to hande binary rpms as well, there will
> be kind of collisions by design. For example, package foo has two
> subpackages: foo-data and libfoo. After foo rebuild foo-data has the
> same identity as previous foo-data build, but libfoo has the different
> now. According the plan, the whole rebuild has significant changes and
> all binary packages should be substituted with new one. And now we have
> two foo-data packages with the same identity value, but they are belong
> to different builds.
>
> > There is one specific problem with the outlined approach: the notion
> > of identity is flawed, because the disttag may or may not matter.
> > Sometimes you cannot substitute a package for another package with the
> > same identity but a different disttag. Specifically this is the case
> > with strict dependencies between subpackages. You cannot substitute a
> > subpackage unless you also substitute all the other subpackages.
>
> Yes, that is correct, I considered this.

So for src.rpm packages, it's a solved problem. For binary packages,
the identity should specifically exclude disttag. It will no longer
satisfy the definition of ID for rpm (substitution will break for
subpackages with strict dependencies). Therefore for binary packages,
we need to track <ID,disttag> tuples. This is a one-to-many relation:
for each ID, there may be a few disttags.  So for binary packages we
need a separate identity-addressable storage which maps ID to
<disttag,filehash> (while for source packages, a hardlink maps ID to
filehash).  If implemented naively, this will create many small files,
one file per ID, most files with just one line. In a more practical
implementation, you should probably group all those small files by
package name.  So you'll have:

$ cat id2f/libfoo
<libfoo-ID1> <disttag1> <libfoo-filehash1>
<libfoo-ID2> <disttag2> <libfoo-filehash2>

$ cat id2f/foo-data
<foo-data-ID1> <disttag1> <foo-data-filehash1>
<foo-data-ID1> <disttag2> <foo-data-filehash2>

Note that for libfoo, the IDs are different, but with foo-data the IDs
are the same. This indicates that the contents of libfoo have changed
after a rebuild, while the contents of foo-data have not.

Suppose you have such a store, and foo.src.rpm is getting rebuilt
again (or copied to p9). You can then check up with the store and see
if the outcome can be replaced with either libfoo-filehash1 +
foo-data-filehash1 (with disttag1) or libfoo-filehash2 +
foo-data-filehash2 (with disttag2), but not in other combinations.
You'll need an elaborate algorithm which coordinates substitutions
across architectures, but this seems doable.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-16 21:51       ` Alexey Tourbin
@ 2020-04-17 13:54         ` Dmitry V. Levin
  2020-04-20  9:05           ` [devel] stopping a cascade of rebuilds Alexey Tourbin
  2020-04-20  8:36         ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Alexey Tourbin
  1 sibling, 1 reply; 22+ messages in thread
From: Dmitry V. Levin @ 2020-04-17 13:54 UTC (permalink / raw)
  To: devel

On Fri, Apr 17, 2020 at 12:51:51AM +0300, Alexey Tourbin wrote:
[...]
> So what's the right set of tags for a binary package, and what is its
> identity? (I'm not sure identity is the right word, I would rather
> call it ID. Identity is who you are and what you believe in, for
> example a black person who votes for Obama.)  I've already hinted that
> identity can be defined via substitution: if you replace a package
> with a different package but the same identity, there should be no
> functional difference, and furthermore no difference "for all intents
> and purposes", except for a few observable differences which we deem
> immaterial and permit explicitly, such as FileMtimes. So obviously you
> need to hash at least FileDigests and
> Requires/Provides/Obsoletes/Conflicts. This should satisfy the
> definition of ID for rpm (the dependencies are satisfied in the same
> way, and file conflicts are the resolved in the same way, so rpm can't
> tell the difference if we make a substitution.)

What about various types of scripts?

$ rpmquery --querytags | grep 'PROG$'
PREINPROG
POSTINPROG
PREUNPROG
POSTUNPROG
VERIFYSCRIPTPROG
TRIGGERSCRIPTPROG

> It isn't clear whether you should hash informational sections such as
> %description. It can be argued that under the same NEVR, the
> description shouldn't change anyway. Is it possible that nothing
> changes in a package but the description? Would we still want to
> update/replace the package then?
> 
> Finally, your identity hash need not to be fixed once and forever. It
> is used only for internal bookkeeping, so once in a while you are
> allowed to change the hash and rebuild the identity-addressable
> storage. You should have a script for that in girar/admin. It may take
> an hour or so to complete, but that's not too bad.

-- 
ldv


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo
  2020-04-16 21:51       ` Alexey Tourbin
  2020-04-17 13:54         ` Dmitry V. Levin
@ 2020-04-20  8:36         ` Alexey Tourbin
  1 sibling, 0 replies; 22+ messages in thread
From: Alexey Tourbin @ 2020-04-20  8:36 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Fri, Apr 17, 2020 at 12:51 AM Alexey Tourbin
<alexey.tourbin@gmail.com> wrote:
> So for src.rpm packages, it's a solved problem. For binary packages,
> the identity should specifically exclude disttag. It will no longer
> satisfy the definition of ID for rpm (substitution will break for
> subpackages with strict dependencies). Therefore for binary packages,
> we need to track <ID,disttag> tuples. This is a one-to-many relation:
> for each ID, there may be a few disttags.  So for binary packages we
> need a separate identity-addressable storage which maps ID to
> <disttag,filehash> (while for source packages, a hardlink maps ID to
> filehash).  If implemented naively, this will create many small files,
> one file per ID, most files with just one line. In a more practical
> implementation, you should probably group all those small files by
> package name.  So you'll have:
>
> $ cat id2f/libfoo
> <libfoo-ID1> <disttag1> <libfoo-filehash1>
> <libfoo-ID2> <disttag2> <libfoo-filehash2>
>
> $ cat id2f/foo-data
> <foo-data-ID1> <disttag1> <foo-data-filehash1>
> <foo-data-ID1> <disttag2> <foo-data-filehash2>
>
> Note that for libfoo, the IDs are different, but with foo-data the IDs
> are the same. This indicates that the contents of libfoo have changed
> after a rebuild, while the contents of foo-data have not.

It may even make sense to group the mappings by src.rpm name instead
of package name. At first it seems less intuitive, but in return it
can give you a consistent view similar to MVCC snapshot.  Of course,
these files should be updated atomically, with rename(2). To check a
set subpackages, you first need to copy the file to a local dir. This
should rule out the case in which some subpackages have been added to
the file and some not.

These files are to be updated during the task-commit stage, under the
exclusive lock. This is also the right moment to detect race
conditions. Suppose you build the same package for sisyphus and p9 in
parallel, and the build result is the same. Before adding new
packages, you recheck if the whole set can be replaced with the
already existing packages.  One of the two tasks then should fail (or
automatically scheduled for another iteration).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [devel] stopping a cascade of rebuilds
  2020-04-17 13:54         ` Dmitry V. Levin
@ 2020-04-20  9:05           ` Alexey Tourbin
  2020-04-23 19:21             ` Vladimir D. Seleznev
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Tourbin @ 2020-04-20  9:05 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Fri, Apr 17, 2020 at 4:54 PM Dmitry V. Levin <ldv@altlinux.org> wrote:
> > So what's the right set of tags for a binary package, and what is its
> > identity? (I'm not sure identity is the right word, I would rather
> > call it ID. Identity is who you are and what you believe in, for
> > example a black person who votes for Obama.)  I've already hinted that
> > identity can be defined via substitution: if you replace a package
> > with a different package but the same identity, there should be no
> > functional difference, and furthermore no difference "for all intents
> > and purposes", except for a few observable differences which we deem
> > immaterial and permit explicitly, such as FileMtimes. So obviously you
> > need to hash at least FileDigests and
> > Requires/Provides/Obsoletes/Conflicts. This should satisfy the
> > definition of ID for rpm (the dependencies are satisfied in the same
> > way, and file conflicts are the resolved in the same way, so rpm can't
> > tell the difference if we make a substitution.)
>
> What about various types of scripts?
>
> $ rpmquery --querytags | grep 'PROG$'
> PREINPROG
> POSTINPROG
> PREUNPROG
> POSTUNPROG
> VERIFYSCRIPTPROG
> TRIGGERSCRIPTPROG

Sure, slipped my mind.

These IDs can have another application. Suppose a task is resumed, and
subtask #100 needs a rebuild. It is likely that, after the rebuild,
the result won't change (all subpackage end up with the same IDs).
However, we replace the packages anyway.  This will trigger the
rebuild of subtask #200, if it depends on subtrask #100. There are two
approaches to spare the unnecessary rebuild:

1) Don't overwrite local files with those from the remote build, if
they are identical.
2) When tracking the contents of BuildRoot, take into account IDs
instead of HeaderSHA1.

There is a peculiarity with the first approach: if subtask #100 has
both arch and noarch subpackages, the decision to overwrite files
cannot be made locally / per architecture. If you need to overwrite
subpackages for at least one architecture, then you have to overwrite
for all architectures, otherwise there will be unmet dependencies
between arch and noarch subpackages.

I think we already have this problem: the decision to rebuild a
subtask is made locally (or rather remotely), per architecture. So
occasionally we've seen unmet dependencies due to disttag mismatch.
Has the problem been addressed?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] stopping a cascade of rebuilds
  2020-04-20  9:05           ` [devel] stopping a cascade of rebuilds Alexey Tourbin
@ 2020-04-23 19:21             ` Vladimir D. Seleznev
  2020-04-23 20:54               ` Dmitry V. Levin
  2020-04-27  5:38               ` Alexey Tourbin
  0 siblings, 2 replies; 22+ messages in thread
From: Vladimir D. Seleznev @ 2020-04-23 19:21 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Mon, Apr 20, 2020 at 12:05:26PM +0300, Alexey Tourbin wrote:
> On Fri, Apr 17, 2020 at 4:54 PM Dmitry V. Levin <ldv@altlinux.org> wrote:
> > > So what's the right set of tags for a binary package, and what is its
> > > identity? (I'm not sure identity is the right word, I would rather
> > > call it ID. Identity is who you are and what you believe in, for
> > > example a black person who votes for Obama.)  I've already hinted that
> > > identity can be defined via substitution: if you replace a package
> > > with a different package but the same identity, there should be no
> > > functional difference, and furthermore no difference "for all intents
> > > and purposes", except for a few observable differences which we deem
> > > immaterial and permit explicitly, such as FileMtimes. So obviously you
> > > need to hash at least FileDigests and
> > > Requires/Provides/Obsoletes/Conflicts. This should satisfy the
> > > definition of ID for rpm (the dependencies are satisfied in the same
> > > way, and file conflicts are the resolved in the same way, so rpm can't
> > > tell the difference if we make a substitution.)
> >
> > What about various types of scripts?
> >
> > $ rpmquery --querytags | grep 'PROG$'
> > PREINPROG
> > POSTINPROG
> > PREUNPROG
> > POSTUNPROG
> > VERIFYSCRIPTPROG
> > TRIGGERSCRIPTPROG

Well, so, your position is that we control and understand our packages,
and we can explicitly define what tags are essential for the task.
Indeed, my idea was to filter all unrelated tags because in the future
rpm-build there can be new tags, but it seems like a needless
reinsurance. You are right that we always can add more tags to calculate
the identity.

> Sure, slipped my mind.

> So for src.rpm packages, it's a solved problem. For binary packages,
> the identity should specifically exclude disttag. It will no longer
> satisfy the definition of ID for rpm (substitution will break for
> subpackages with strict dependencies). Therefore for binary packages,
> we need to track <ID,disttag> tuples.

Why should we track them? If we rebuild a package, we should check
whether identity of its binary packages had changed. If it had not, we
shouldn't replace its binary packages by rebuilt packages. That simple.

The things get more complicated in case of "copying" packages. In that
case this schema could help. But we need also track buildtime. Just to
prevent package replacing with earlier build. So, it is triple now.

> This is a one-to-many relation: for each ID, there may be a few
> disttags.  So for binary packages we need a separate
> identity-addressable storage which maps ID to <disttag,filehash>
> (while for source packages, a hardlink maps ID to filehash).  If
> implemented naively, this will create many small files, one file per
> ID, most files with just one line. In a more practical implementation,
> you should probably group all those small files by package name.  So
> you'll have:
> 
> $ cat id2f/libfoo
> <libfoo-ID1> <disttag1> <libfoo-filehash1>
> <libfoo-ID2> <disttag2> <libfoo-filehash2>
> 
> $ cat id2f/foo-data
> <foo-data-ID1> <disttag1> <foo-data-filehash1>
> <foo-data-ID1> <disttag2> <foo-data-filehash2>
>
> Note that for libfoo, the IDs are different, but with foo-data the IDs
> are the same. This indicates that the contents of libfoo have changed
> after a rebuild, while the contents of foo-data have not.

That is correct.

> It may even make sense to group the mappings by src.rpm name instead
> of package name. At first it seems less intuitive, but in return it
> can give you a consistent view similar to MVCC snapshot.  Of course,
> these files should be updated atomically, with rename(2). To check a
> set subpackages, you first need to copy the file to a local dir. This
> should rule out the case in which some subpackages have been added to
> the file and some not.

I don't get this idea, please expand it.

> These files are to be updated during the task-commit stage, under the
> exclusive lock. This is also the right moment to detect race
> conditions. Suppose you build the same package for sisyphus and p9 in
> parallel, and the build result is the same. Before adding new
> packages, you recheck if the whole set can be replaced with the
> already existing packages.  One of the two tasks then should fail (or
> automatically scheduled for another iteration).

> These IDs can have another application. Suppose a task is resumed, and
> subtask #100 needs a rebuild. It is likely that, after the rebuild,
> the result won't change (all subpackage end up with the same IDs).
> However, we replace the packages anyway.  This will trigger the
> rebuild of subtask #200, if it depends on subtrask #100. There are two
> approaches to spare the unnecessary rebuild:
> 
> 1) Don't overwrite local files with those from the remote build, if
> they are identical.
> 2) When tracking the contents of BuildRoot, take into account IDs
> instead of HeaderSHA1.

I like the idea.

> There is a peculiarity with the first approach: if subtask #100 has
> both arch and noarch subpackages, the decision to overwrite files
> cannot be made locally / per architecture. If you need to overwrite
> subpackages for at least one architecture, then you have to overwrite
> for all architectures, otherwise there will be unmet dependencies
> between arch and noarch subpackages.

That is correct.

> I think we already have this problem: the decision to rebuild a
> subtask is made locally (or rather remotely), per architecture. So
> occasionally we've seen unmet dependencies due to disttag mismatch.
> Has the problem been addressed?

Hmm, I'll think about that.

P.S.

> I'm not sure identity is the right word, I would rather call it ID.

It's just a term, we can define it as we want. There was another option
to name it BuildID, but for some reason it was refused.

-- 
   WBR,
   Vladimir D. Seleznev


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] stopping a cascade of rebuilds
  2020-04-23 19:21             ` Vladimir D. Seleznev
@ 2020-04-23 20:54               ` Dmitry V. Levin
  2020-04-27  5:38               ` Alexey Tourbin
  1 sibling, 0 replies; 22+ messages in thread
From: Dmitry V. Levin @ 2020-04-23 20:54 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Thu, Apr 23, 2020 at 10:21:28PM +0300, Vladimir D. Seleznev wrote:
[...]
> It's just a term, we can define it as we want. There was another option
> to name it BuildID, but for some reason it was refused.

The reason why it was rejected is to avoid confusion with Build ID:

$ readelf -n /bin/true |grep -i 'build.*id'
Displaying notes found in: .note.gnu.build-id
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: a8463ba44bcb8caad1b9c466a28dc573ff7878f2


-- 
ldv


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [devel] stopping a cascade of rebuilds
  2020-04-23 19:21             ` Vladimir D. Seleznev
  2020-04-23 20:54               ` Dmitry V. Levin
@ 2020-04-27  5:38               ` Alexey Tourbin
  1 sibling, 0 replies; 22+ messages in thread
From: Alexey Tourbin @ 2020-04-27  5:38 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Thu, Apr 23, 2020 at 10:21 PM Vladimir D. Seleznev
<vseleznv@altlinux.org> wrote:
> > So for src.rpm packages, it's a solved problem. For binary packages,
> > the identity should specifically exclude disttag. It will no longer
> > satisfy the definition of ID for rpm (substitution will break for
> > subpackages with strict dependencies). Therefore for binary packages,
> > we need to track <ID,disttag> tuples.
>
> Why should we track them? If we rebuild a package, we should check
> whether identity of its binary packages had changed. If it had not, we
> shouldn't replace its binary packages by rebuilt packages. That simple.

Because in the identity-addressable storage, we can have a few
packages with the same ID but different disttags, as in the example
below.  The fact that foo-data-ID1 hasn't changed doesn't mean you can
immediately grab foo-data-filehash1.

> > $ cat id2f/libfoo
> > <libfoo-ID1> <disttag1> <libfoo-filehash1>
> > <libfoo-ID2> <disttag2> <libfoo-filehash2>
> >
> > $ cat id2f/foo-data
> > <foo-data-ID1> <disttag1> <foo-data-filehash1>
> > <foo-data-ID1> <disttag2> <foo-data-filehash2>

> The things get more complicated in case of "copying" packages. In that
> case this schema could help. But we need also track buildtime. Just to
> prevent package replacing with earlier build. So, it is triple now.

Hmm, haven't thought of buildtime. We've got a package in the repo, a
freshly rebuilt package with the same EVR, and an identical package
(or a few identical packages) in the identity-addressable storage. We
need to prevent the outcome in which the buildtime of the package in
the repo goes down.  That's what you mean by "prevent package
replacing with earlier build", right?  (Otherwise replacing with an
earlier build is not a problem.)  Such an outcome is unlikely to occur
in practice, but not impossible.

Three parties are now involved.  You cannot simply ask, "can I
overwrite build/100 with packages from identity-addressable storage?"
The third piece of information is what you already have in the repo.

> > It may even make sense to group the mappings by src.rpm name instead
> > of package name. At first it seems less intuitive, but in return it
> > can give you a consistent view similar to MVCC snapshot.  Of course,
> > these files should be updated atomically, with rename(2). To check a
> > set subpackages, you first need to copy the file to a local dir. This
> > should rule out the case in which some subpackages have been added to
> > the file and some not.
>
> I don't get this idea, please expand it.

These are implementation details (that are not all that important
unless we agree on the data model).  How do you access the
identity-addressable storage concurrently?  The worst case is that
id2f/libfoo is written and read simultaneously, so you read a
half-written file.  To avoid this, files should be replaced
atomically:

$ mv local/libfoo id2f/.tmp$$.libfoo && mv id2f/.tmp$$.libfoo id2f/libfoo

You are now guaranteed that you open either old id2f/libfoo or new
id2f/libfoo, but not something in between. But this is not enough,
because there is a higher-order inconsistency: you don't want to
process new id2f/libfoo with old id2f/foo-data (when id2f/foo-data is
just being replaced).  To get a "consistent view" across subpackages,
you can combine id2f/libfoo and id2f/foo-data into a single file. To
process subpackages, you first need to make a local copy of that file.
Because with command-line tools, you'd want to open the file a few
times, so you can't do it directly on id2f/libfoo.

This is an alternative to the "single writer, multiple readers" model.
In this model, the reader first has to obtain a shared lock on id2f.
In what I describe, the reader can proceed without any locking.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-04-27  5:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-10 23:10 [devel] RFC: girar: optimize rebuild Vladimir D. Seleznev
2020-04-10 23:10 ` [devel] [PATCH 1/2] gb/gb-sh-functions: introduce pkg_identity() Vladimir D. Seleznev
2020-04-13 18:01   ` Dmitry V. Levin
2020-04-13 19:32     ` Vladimir D. Seleznev
2020-04-10 23:10 ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Vladimir D. Seleznev
2020-04-11 11:29   ` Alexey Tourbin
2020-04-14 16:42     ` Vladimir D. Seleznev
2020-04-16 21:51       ` Alexey Tourbin
2020-04-17 13:54         ` Dmitry V. Levin
2020-04-20  9:05           ` [devel] stopping a cascade of rebuilds Alexey Tourbin
2020-04-23 19:21             ` Vladimir D. Seleznev
2020-04-23 20:54               ` Dmitry V. Levin
2020-04-27  5:38               ` Alexey Tourbin
2020-04-20  8:36         ` [devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo Alexey Tourbin
2020-04-11 10:36 ` [devel] RFC: girar: optimize rebuild Andrey Savchenko
2020-04-11 15:33   ` Vladimir D. Seleznev
2020-04-11 23:31   ` Alexey V. Vissarionov
2020-04-14 14:57     ` Andrey Savchenko
2020-04-14 16:20       ` Vladimir D. Seleznev
2020-04-11 11:04 ` Gleb Fotengauer-Malinovskiy
2020-04-11 15:21   ` Vladimir D. Seleznev
2020-04-11 16:41     ` Gleb Fotengauer-Malinovskiy

ALT Linux Team development discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://lore.altlinux.org/devel/0 devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 devel devel/ http://lore.altlinux.org/devel \
		devel@altlinux.org devel@altlinux.ru devel@lists.altlinux.org devel@lists.altlinux.ru devel@linux.iplabs.ru mandrake-russian@linuxteam.iplabs.ru sisyphus@linuxteam.iplabs.ru
	public-inbox-index devel

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://lore.altlinux.org/org.altlinux.lists.devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git