ALT Linux Team development discussions
 help / color / mirror / Atom feed
* [devel] rpmlndup
@ 2009-02-20 21:19 Igor Vlasenko
  2009-02-20 21:20 ` Dmitry V. Levin
  2009-02-20 21:22 ` Mikhail Gusarov
  0 siblings, 2 replies; 9+ messages in thread
From: Igor Vlasenko @ 2009-02-20 21:19 UTC (permalink / raw)
  To: devel

[-- Attachment #1: Type: text/plain, Size: 663 bytes --]

Раз уже пошла тема о скриптах,
поделюсь скриптом rpmlndup.

=head1	NAME

rpmlndup - a tool that reduces rpm repositories size by hardlinking identical rpms.

identical = name, real size and sha1/md5 sig are the same.

когда я его у себя запустил, то винт похудел на 70Гб,
и несмотря на то, что я пользуюсь --link-dest.
--link-dest ко всем dest не напишешь.

Если будет интерес, напишу help и выложу в Сизиф.


-- 

Dr. Igor Vlasenko
--------------------
Topology Department
Institute of Math
Kiev, Ukraine


[-- Attachment #2: rpmlndup --]
[-- Type: text/plain, Size: 5402 bytes --]

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Find;
use RPM::Header;
use Getopt::Long;

my $verbose=1;
my $skipnosum=0;

my $result = GetOptions (
    'quiet'=> sub {$verbose=0},
    "skip-no-sum"  => \$skipnosum,
    "verbose+"  => \$verbose,
);

my @directories = @ARGV;

map {-d $_ or die "argument is not a directory: $_\n"} @directories;

# first step is just a usual find; to find dup names
my %rpmbyname;
find(\&wanted,  @directories);
sub wanted {
# $File::Find::dir  = /some/path/
# $_                = foo.ext
# $File::Find::name = /some/path/foo.ext
    my $name=$_;
    return unless /\.rpm$/ and not -l $_;
    $rpmbyname{$name}=[] unless defined $rpmbyname{$name};
    my @stat=stat $name;
#  0 dev      device number of filesystem
#  1 ino      inode number
#  2 mode     file mode  (type and permissions)
#  3 nlink    number of (hard) links to the file
#  4 uid      numeric user ID of file's owner
#  5 gid      numeric group ID of file's owner
#  6 rdev     the device identifier (special files only)
#  7 size     total size of file, in bytes
#  8 atime    last access time in seconds since the epoch
#  9 mtime    last modify time in seconds since the epoch
# 10 ctime    inode change time in seconds since the epoch (*)
# 11 blksize  preferred block size for file system I/O
# 12 blocks   actual number of blocks allocated
    my $size = $stat[7];
    push @{$rpmbyname{$name}}, {
	NAME=> $name,
#	DIR => $File::Find::dir,
	PATH=> $File::Find::name,
	INODE => $stat[1],
	SIZE => $stat[7],
    };
}

# second step is to find genuine dups; the same size and sha1/md5sum.
my %rpmbysum;
while (my ($rpm, $lptr)=each %rpmbyname) {
    next if $#{$lptr}<1;
    my %inodes;
    map {$inodes{$_->{INODE}}=1} @$lptr;
    next if scalar keys(%inodes) < 2;
    map {&bysum($_)} @$lptr;
}

undef %rpmbyname;
my $dupcount=0;
my $economy=0;

my @rpmtolink;
while (my ($rpm, $lptr)=each %rpmbysum) {
    next if $#{$lptr}<1;
    my %inodes;
    map {$inodes{$_->{INODE}}=1} @$lptr;
    next if scalar keys(%inodes) < 2;
    my $dupnum=keys(%inodes)-1;
    #print "$rpm\n";
    $economy+=$lptr->[0]->{SIZE}*$dupnum;
    $dupcount+=$dupnum;
    push @rpmtolink, $lptr;
}
undef %rpmbysum;

print STDERR "hardlinking duplicate rpms will give print total economy:
$economy bytes in $dupcount rpms.\n";
print STDERR "Do you want to continue (y/n)?.\n";
@ARGV=();
$_=<>;
exit 0 unless (/^\s*y/i);
print "continue with ".scalar @rpmtolink." dups\n";

foreach my $lref (@rpmtolink) {
    die "internal error! not enough files!" if @$lref < 2;
    my $master=$lref->[0];
    my $masterinode=$master->{INODE};
    my $masterpath=$master->{PATH};
    for (my $i=1; $i < @$lref; $i++) {
	my $slave=$lref->[$i];
	my $slavepath=$slave->{PATH};
	#warn "already linked $masterpath $slavepath\n" if $slave->{INODE} = $masterinode;
	if ($slave->{INODE} != $masterinode) {
	    die "impossible :(" if $slavepath eq $masterpath;
	    rename $slavepath, $slavepath.'.bak' || die "rename $slavepath, $slavepath.bak failed: $!";
	    unless (link $masterpath, $slavepath) {
		warn "link $masterpath, $slavepath failed: $!";
		rename $slavepath.'.bak', $slavepath;
		die "execution aborted.";
	    }
	    system('touch','-acm','-r',$slavepath.'.bak','--',$slavepath);
	    unlink $slavepath.'.bak' || die "cleanup of $slavepath failed: $!";
	    print "linked successfully: $masterpath -> $slavepath\n" if $verbose;
	}
    }
}

sub bysum {
    my $rpm=$_[0];
    my $size = $rpm->{SIZE};
    my $header;
    eval {
	$header=new RPM::Header $rpm->{PATH};
    };
    if ($@) {
	warn "$rpm->{PATH} skipped: $@\n" if $verbose;
	return;
    }
    my $sum = $header->{SHA1HEADER}->[0];
    unless ($sum) {
	warn "no sha1sum for $rpm->{NAME} - trying MD5\n" if $verbose;
	$sum = $header->{SIGMD5}->[0];
	unless ($sum) {
	    warn "no md5sum for $rpm->{NAME}\n" if $verbose;
	    return if $skipnosum;
	    # let at list declared size be the same
	    $sum=$header->{SIGSIZE}->[0];
	    $sum||=$size;
	}
    }
    $rpm->{SUM}=$sum;
    my $key=$rpm->{NAME}.'!'.$sum.'|'.$size;
    $rpmbysum{$key}=[] unless defined $rpmbysum{$key};
    push @{$rpmbysum{$key}}, $rpm;
}

=head1	NAME

rpmlndup - a tool that reduces rpm repositories size by hardlinking identical rpms.

=head1	SYNOPSIS

B<rpmlndup>
[B<-h|--help>]
[B<-v|--verbose>]
[B<-q|--quiet>]
[B<-y|--yes|--batch>]
[B<-a|--ask|--interactive>]
[B<-n|--no|--count>]
[B<-s|--skip-no-sum>]
[I<DIR>...] 

=head1	DESCRIPTION

B<rpmlndup> 

=head1	OPTIONS

=over

=item	B<-h, --help>

Display this help and exit.

=item	B<-v, --verbose>, B<-q, --quiet>

Verbosity level. Multiple -v increase the verbosity level, -q sets it to 0.

=item	B<-y|--yes>, B<--batch>

Batch mode. links identical rpm after counting.

=item	B<-n|--no>, B<--count>

No linking identical rpm, just counting space to be freed.

=item	B<-a|--ask>, B<--interactive>

Interactive mode (default). Counts free space and asks to proceed with linking.

=item	B<-s|--skip-no-sum>

Skip unsigned rpms (that have no sha1 or md5 sum).

=back

=head1	AUTHOR

Written by Igor Vlasenko <viy@altlinux.org>.

=head1	COPYING

Copyright (c) 2009 Igor Vlasenko, ALT Linux Team.

This is free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software Foundation;
either version 2 of the License, or (at your option) any later version.

=cut


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:19 [devel] rpmlndup Igor Vlasenko
@ 2009-02-20 21:20 ` Dmitry V. Levin
  2009-02-20 21:28   ` Igor Vlasenko
  2009-02-20 21:22 ` Mikhail Gusarov
  1 sibling, 1 reply; 9+ messages in thread
From: Dmitry V. Levin @ 2009-02-20 21:20 UTC (permalink / raw)
  To: ALT Devel discussion list

[-- Attachment #1: Type: text/plain, Size: 303 bytes --]

On Fri, Feb 20, 2009 at 11:19:13PM +0200, Igor Vlasenko wrote:
> Раз уже пошла тема о скриптах,
> поделюсь скриптом rpmlndup.
> 
> =head1	NAME
> 
> rpmlndup - a tool that reduces rpm repositories size by hardlinking identical rpms.

hardlink(1) из одноимённого пакета не годится?


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:19 [devel] rpmlndup Igor Vlasenko
  2009-02-20 21:20 ` Dmitry V. Levin
@ 2009-02-20 21:22 ` Mikhail Gusarov
  2009-02-20 21:23   ` [devel] [jt] rpmlndup Dmitry V. Levin
  1 sibling, 1 reply; 9+ messages in thread
From: Mikhail Gusarov @ 2009-02-20 21:22 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]


Twas brillig at 23:19:13 20.02.2009 UTC+02 when vlasenko@imath.kiev.ua did gyre and gimble:

 IV> Раз уже пошла тема о скриптах, поделюсь скриптом rpmlndup.

Недавно в Debian попытались залить четвёртый пакет с ровно такой же
функциональностью :) Заливателя чуть не затравили.

-- 

[-- Attachment #2: Type: application/pgp-signature, Size: 196 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] [jt] rpmlndup
  2009-02-20 21:22 ` Mikhail Gusarov
@ 2009-02-20 21:23   ` Dmitry V. Levin
  2009-02-20 21:25     ` Mikhail Gusarov
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry V. Levin @ 2009-02-20 21:23 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 418 bytes --]

On Sat, Feb 21, 2009 at 03:22:06AM +0600, Mikhail Gusarov wrote:
> 
> Twas brillig at 23:19:13 20.02.2009 UTC+02 when vlasenko@imath.kiev.ua did gyre and gimble:
> 
>  IV> Раз уже пошла тема о скриптах, поделюсь скриптом rpmlndup.
> 
> Недавно в Debian попытались залить четвёртый пакет с ровно такой же
> функциональностью :) Заливателя чуть не затравили.

Злые они без причины, вот и всё. :(


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] [jt] rpmlndup
  2009-02-20 21:23   ` [devel] [jt] rpmlndup Dmitry V. Levin
@ 2009-02-20 21:25     ` Mikhail Gusarov
  0 siblings, 0 replies; 9+ messages in thread
From: Mikhail Gusarov @ 2009-02-20 21:25 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]


Twas brillig at 00:23:55 21.02.2009 UTC+03 when ldv@altlinux.org did gyre and gimble:

 >> Недавно в Debian попытались залить четвёртый пакет с ровно такой же
 >> функциональностью :) Заливателя чуть не затравили.

 DVL> Злые они без причины, вот и всё. :(

Не совсем. "Затравили" там - это по уровню накала эмоций и злости
приблизительно как типичная приятная беседа в этом списке рассылки :)

-- 

[-- Attachment #2: Type: application/pgp-signature, Size: 196 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:20 ` Dmitry V. Levin
@ 2009-02-20 21:28   ` Igor Vlasenko
  2009-02-20 21:32     ` Mikhail Gusarov
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Vlasenko @ 2009-02-20 21:28 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Sat, Feb 21, 2009 at 12:20:49AM +0300, Dmitry V. Levin wrote:
> On Fri, Feb 20, 2009 at 11:19:13PM +0200, Igor Vlasenko wrote:
> > Раз уже пошла тема о скриптах,
> > поделюсь скриптом rpmlndup.
> > 
> > =head1	NAME
> > 
> > rpmlndup - a tool that reduces rpm repositories size by hardlinking identical rpms.
> 
> hardlink(1) из одноимённого пакета не годится?

Век живи, век учись :)

С другой стороны, hardlink Compare the contents of the files,
а rpmlndup только проверяет md5 sig, вшитую в RPM Header.
Теоретически, значительно быстрее должен работать,
но я о hardlink не знал, не сравнивал.

-- 

Dr. Igor Vlasenko
--------------------
Topology Department
Institute of Math
Kiev, Ukraine



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:28   ` Igor Vlasenko
@ 2009-02-20 21:32     ` Mikhail Gusarov
  2009-02-20 21:44       ` Led
  0 siblings, 1 reply; 9+ messages in thread
From: Mikhail Gusarov @ 2009-02-20 21:32 UTC (permalink / raw)
  To: ALT Linux Team development discussions

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]


Twas brillig at 23:28:53 20.02.2009 UTC+02 when vlasenko@imath.kiev.ua did gyre and gimble:

 IV> С другой стороны, hardlink Compare the contents of the files, а
 IV> rpmlndup только проверяет md5 sig, вшитую в RPM Header.
 IV> Теоретически, значительно быстрее должен работать, но я о hardlink
 IV> не знал, не сравнивал.

hardlink(1) сравнивает содержимое только в том случае, когда размер
совпадает, так что разница будет перенебрежимо мала.

-- 

[-- Attachment #2: Type: application/pgp-signature, Size: 196 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:32     ` Mikhail Gusarov
@ 2009-02-20 21:44       ` Led
  2009-02-20 21:49         ` Igor Vlasenko
  0 siblings, 1 reply; 9+ messages in thread
From: Led @ 2009-02-20 21:44 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Friday, 20 February 2009 23:32:28 Mikhail Gusarov wrote:
> Twas brillig at 23:28:53 20.02.2009 UTC+02 when vlasenko@imath.kiev.ua did
> gyre and gimble:
>
>  IV> С другой стороны, hardlink Compare the contents of the files, а
>  IV> rpmlndup только проверяет md5 sig, вшитую в RPM Header.
>  IV> Теоретически, значительно быстрее должен работать, но я о hardlink
>  IV> не знал, не сравнивал.
>
> hardlink(1) сравнивает содержимое только в том случае, когда размер
> совпадает, так что разница будет перенебрежимо мала.

...и неизвечтно, в чью пользу (если rpmlndup не сравнивает размер, перед тем, 
как выковыривать md5):)

-- 
Led

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [devel] rpmlndup
  2009-02-20 21:44       ` Led
@ 2009-02-20 21:49         ` Igor Vlasenko
  0 siblings, 0 replies; 9+ messages in thread
From: Igor Vlasenko @ 2009-02-20 21:49 UTC (permalink / raw)
  To: ALT Linux Team development discussions

On Fri, Feb 20, 2009 at 11:44:18PM +0200, Led wrote:
> On Friday, 20 February 2009 23:32:28 Mikhail Gusarov wrote:
> >  IV> С другой стороны, hardlink Compare the contents of the files, а
> >  IV> rpmlndup только проверяет md5 sig, вшитую в RPM Header.
> >  IV> Теоретически, значительно быстрее должен работать, но я о hardlink
> >  IV> не знал, не сравнивал.
> >
> > hardlink(1) сравнивает содержимое только в том случае, когда размер
> > совпадает, так что разница будет перенебрежимо мала.
> 
> ...и неизвечтно, в чью пользу (если rpmlndup не сравнивает размер, перед тем, 
> как выковыривать md5):)

выковыривает md5 только из кандидатов в дубли.

-- 

Dr. Igor Vlasenko
--------------------
Topology Department
Institute of Math
Kiev, Ukraine



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-02-20 21:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20 21:19 [devel] rpmlndup Igor Vlasenko
2009-02-20 21:20 ` Dmitry V. Levin
2009-02-20 21:28   ` Igor Vlasenko
2009-02-20 21:32     ` Mikhail Gusarov
2009-02-20 21:44       ` Led
2009-02-20 21:49         ` Igor Vlasenko
2009-02-20 21:22 ` Mikhail Gusarov
2009-02-20 21:23   ` [devel] [jt] rpmlndup Dmitry V. Levin
2009-02-20 21:25     ` Mikhail Gusarov

ALT Linux Team development discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://lore.altlinux.org/devel/0 devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 devel devel/ http://lore.altlinux.org/devel \
		devel@altlinux.org devel@altlinux.ru devel@lists.altlinux.org devel@lists.altlinux.ru devel@linux.iplabs.ru mandrake-russian@linuxteam.iplabs.ru sisyphus@linuxteam.iplabs.ru
	public-inbox-index devel

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://lore.altlinux.org/org.altlinux.lists.devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git