From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 12 Mar 2003 04:56:16 +0300 From: Alexey Tourbin To: devel@altlinux.ru Message-ID: <20030312015616.GA2882@solemn.turbinal.org> Mail-Followup-To: devel@altlinux.ru Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="fdj2RfSjLxBAspz7" Content-Disposition: inline Subject: [devel] gcc -O2 vs gcc -Os performance Sender: devel-admin@altlinux.ru Errors-To: devel-admin@altlinux.ru X-BeenThere: devel@altlinux.ru X-Mailman-Version: 2.0.9 Precedence: bulk Reply-To: devel@altlinux.ru List-Unsubscribe: , List-Id: List-Post: List-Help: List-Subscribe: , List-Archive: Archived-At: List-Archive: List-Post: --fdj2RfSjLxBAspz7 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: 8bit Greetings! Subj: по мотивам одноименной истории из linux-kernel -- см. тред http://lists.insecure.org/lists/linux-kernel/2003/Feb/0984.html Andi Kleen (ak_at_suse.de): -Os on 2.95 is not too useful. It only started becomming useful on 3.1+, even more so on the upcomming 3.3. e.g. there was one report of ACPI shrinking by >60k by recompiling it with -Os on 3.1. ACPI is only slow path code so that is completely reasonable. Alan Cox (alan_at_lxorguk.ukuu.org.uk): gcc 3.2 is a lot smarter about -Os and it makes a very big size difference according to the numbers the from the ACPI guys. Martin J. Bligh (mbligh_at_aracnet.com): 2901299 vmlinux.O2 2667827 vmlinux.Os Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed User System CPU 2.5.59-mjb3-gcc32-O2 45.86 564.75 110.91 1472.67 2.5.59-mjb3-gcc32-Os 45.74 563.96 111.06 1475.17 Linus Torvalds (torvalds_at_transmeta.com): That's since a large part of the premise of the -Os speed advantage is that it is better for icache (usually not an issue for microbenchmarks) and that it is better for load/startup times (generally not a huge issue for kernels, since the real startup costs of kernels tend to be entirely elsewhere). So I suspect -Os tends to be more appropriate for user-mode code, and especially code with low repeat rates. Possibly the "low repeat rate" thing ends up being true of certain kernel subsystems too. Think of it this way: if you win 10% in size, you're likely to map and load 10% less code pages at run-time. Which is not a big issue for traditional data-centric loads, but can be a _huge_ deal for things like GUI programs etc where there is often more code than data. Я решил провести собственное "userland" небольшое исследование. Для этого я пересобрал пакеты dillo и perl с "%define _optlevel s". dillo -O2: 300792 /usr/bin/dillo 217358 dillo-0.7.1-alt1.i686.rpm dillo -Os: 247544 /usr/bin/dillo 206148 dillo-0.7.1-alt1.i686.rpm perl -O2: 1213020 /usr/lib/libperl.so.5.8 3345684 perl-5.8.0-alt1.1.i686.rpm 1400875 perl-base-5.8.0-alt1.1.i686.rpm 820888 perl-devel-5.8.0-alt1.1.i686.rpm 44051 perl-suidperl-5.8.0-alt1.1.i686.rpm $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 37 wallclock secs (37.05 usr + 0.01 sys = 37.06 CPU) @ 28294.01/s (n=1048576) $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 36 wallclock secs (36.84 usr + 0.00 sys = 36.84 CPU) @ 28462.98/s (n=1048576) $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 37 wallclock secs (37.00 usr + 0.01 sys = 37.01 CPU) @ 28332.23/s (n=1048576) $ perl -Os: 1057258 /usr/lib/libperl.so.5.8 3333843 perl-5.8.0-alt1.1.i686.rpm 1345924 perl-base-5.8.0-alt1.1.i686.rpm 818170 perl-devel-5.8.0-alt1.1.i686.rpm 42578 perl-suidperl-5.8.0-alt1.1.i686.rpm $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 35 wallclock secs (34.19 usr + 0.00 sys = 34.19 CPU) @ 30669.08/s (n=1048576) $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 34 wallclock secs (34.33 usr + 0.00 sys = 34.33 CPU) @ 30544.01/s (n=1048576) $ perl -MBenchmark -e 'timethis(2**20, "\$y=sin(3.14)+cos(3.15);\$y=~s/\$y/./igs;")' timethis 1048576: 34 wallclock secs (34.33 usr + 0.01 sys = 34.34 CPU) @ 30535.12/s (n=1048576) $ Машина Celeron333. Выглядит очень привлекательно: размер бинарей уменьшается на 10-20%, а падение производительности: у ядра почти не падает, а у перла -- в данном частном случае растёт на 8-9%!!!!!! Скорее всего, это именно из-за маленького кэша у Celeron'а. При этом измеряется некая абстрактная производительность в идеальных условиях; в реальных условиях реальная производительность может расти ещё больше. Кроме того, уменьшается (хотя и не так сильно) размер RPM пакетов, что достаточно важно как для подготовки однодисковых дистрибутивов, так и для уменьшения интернет-трафика. А также для создания минимальных систем! :) Какие будут мнения? --fdj2RfSjLxBAspz7 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQE+bpO/fBKgtDjnu0YRAk/HAJoDYkqEYAWdCEELzuHo285yPEZOTgCgv4Aa Xs2JH6PYe/mXD5bf8WD8cXw= =BeRk -----END PGP SIGNATURE----- --fdj2RfSjLxBAspz7--