From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <avl@l14.ru>
Date: Thu, 13 Mar 2003 20:33:12 +0300
From: "Alexey V. Lubimov" <avl@l14.ru>
To: sisyphus@altlinux.ru
Message-Id: <20030313203312.5b36a942.avl@l14.ru>
Organization: RIC
X-Mailer: Sylpheed version 0.8.8 (GTK+ 1.2.10; i586-alt-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 8bit
Subject: [sisyphus] xfs =?KOI8-R?Q?=CE=C1?= software raid5.
Sender: sisyphus-admin@altlinux.ru
Errors-To: sisyphus-admin@altlinux.ru
X-BeenThere: sisyphus@altlinux.ru
X-Mailman-Version: 2.0.9
Precedence: bulk
Reply-To: sisyphus@altlinux.ru
List-Unsubscribe: <http://altlinux.ru/mailman/listinfo/sisyphus>,
	<mailto:sisyphus-request@altlinux.ru?subject=unsubscribe>
List-Id: <sisyphus.altlinux.ru>
List-Post: <mailto:sisyphus@altlinux.ru>
List-Help: <mailto:sisyphus-request@altlinux.ru?subject=help>
List-Subscribe: <http://altlinux.ru/mailman/listinfo/sisyphus>,
	<mailto:sisyphus-request@altlinux.ru?subject=subscribe>
List-Archive: <http://altlinux.ru/pipermail/sisyphus/>
Archived-At: <http://lore.altlinux.org/sisyphus/20030313203312.5b36a942.avl@l14.ru/>
List-Archive: <http://lore.altlinux.org/sisyphus/>

Кажись, нашел объяснение постоянным сообщениям в логе 


raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512
...
raid5: switching cache buffer size, 4096 --> 512
raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512
raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512

и предположение о том, что это неэффективно тоже подтвердилось.

==========================================================

This seems to be an area of constant confusion.  Let me describe why
XFS sucks on Linux software RAID5.  It has nothing to do with
controllers, physical disk layout, or anything like that.

RAID5 works by saving a N-1 chunks of data followed by a chunk of
parity information (the location of the checksum chunk is actually
interleaved between devices with RAID5, but whatever).  These N-1 data
chunks + the parity blob make out a stripe.

Every time you update any chunk of data you need to read in the rest
of the data chunks in that stripe, calculate the parity, and then
write out the modified data chunk + parity.

This sucks performance-wise because a write could worst case end up
causing N-2 reads (at this point you have your updated chunk in
memory) followed by 2 writes.  The Linux RAID5 personality isn't quite
that stupid and actually uses a slightly different algorithm involving
reading old data + parity off disk, masking, and then writing the new
data + parity back.

In any case Linux software RAID keeps a stripe cache around to cut
down on the disk I/Os caused by parity updates.  And this cache really
improves performance.

Now.  Unlike the other Linux filesystems, XFS does not stick to one
I/O size.  The filesystem data blocks are 4K (on PC anyway), but log
entries will be written in 512 byte chunks.

Unfortunately these 512 byte I/Os will cause the RAID5 code to flush
its entire stripe cache and reconfigure it for 512 byte I/O sizes.
Then, a few ms later, we come back and do a 4K data write, causing the
damn thing to be flushed again.  And so on.

IOW, Linux software RAID5 code was written for filesystems like ext2
that only do fixed size I/Os.

So the real problem is that because XFS keeps switching the I/O size,
the RAID5 code effectively runs without a stripe cache.  And that's
what's making the huge sucking sound.  This will be fixed -
eventually...

By moving the XFS journal to a different device (like a software RAID1
as we suggest in the FAQ), you can work around this problem.

And finally - All the hardware RAID controllers I have worked with
stick to one I/O size internally and don't have this problem.  They do
read-modify-write on their own preferred size I/Os anyway.

======================================================================

Правда, есть и решение - переместить лог на другое устройство:

=====================================================================

Of course the thing I missed is that if you run growfs to grow a log
it comes back and says:

xfs_growfs: log growth not supported yet

xfs_db also has some endian issues with the write command. I did however
manage to grow a log:

1. select your partition to become the log and dd a bunch of zeros over the
   complete range you want to be the log, so 32768 blocks would be:

   dd if=/dev/zero of=/dev/XXX bs=32768 count=4096

2. run xfs_db -x on the original unmounted filesystem use sb 0 to get to
   the super block.

3. reset the log offset using 

	write logstart 0

4. set the new log size using

	write logblocks xxxx

   where xxxx is the size in 4K blocks, xfs_db will come out with a new
   value which is different than this, feed this new value back into the
   same command and it will report the correct version. This is the endian
   conversion bug in xfs_db:

	xfs_db: write logblocks 32768
	logblocks = 8388608
	xfs_db: write logblocks 8388608
	logblocks = 32768

5. mount the filesystem using the logdev option to point at the new log:

         mount -t xfs -o logbufs=4,osyncisdsync,logdev=/dev/sda6 /dev/sda5 /xfs

You now have a filesystem with a new external log. Going back is harder since
you need to zero the old log and reset the logstart and logblocks fields.

It does occur to me that by using a different logstart than zero you could
put two external logs on the same partition, not sure what happens with
the device open close logic if you do this though.

Steve

============================================================================


Вот так вот...


-- 
С уважением, Алексей Любимов avl@cad.ru