[sisyphus] xfs на software raid5.

ALT Linux Sisyphus discussions
 help / color / mirror / Atom feed

* [sisyphus] xfs на software raid5.
@ 2003-03-13 17:33 Alexey V. Lubimov
  2003-03-14 10:57 ` Владимир
  0 siblings, 1 reply; 2+ messages in thread
From: Alexey V. Lubimov @ 2003-03-13 17:33 UTC (permalink / raw)
  To: sisyphus

Кажись, нашел объяснение постоянным сообщениям в логе 

raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512
...
raid5: switching cache buffer size, 4096 --> 512
raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512
raid5: switching cache buffer size, 512 --> 4096
raid5: switching cache buffer size, 4096 --> 512

и предположение о том, что это неэффективно тоже подтвердилось.

==========================================================

This seems to be an area of constant confusion.  Let me describe why
XFS sucks on Linux software RAID5.  It has nothing to do with
controllers, physical disk layout, or anything like that.

RAID5 works by saving a N-1 chunks of data followed by a chunk of
parity information (the location of the checksum chunk is actually
interleaved between devices with RAID5, but whatever).  These N-1 data
chunks + the parity blob make out a stripe.

Every time you update any chunk of data you need to read in the rest
of the data chunks in that stripe, calculate the parity, and then
write out the modified data chunk + parity.

This sucks performance-wise because a write could worst case end up
causing N-2 reads (at this point you have your updated chunk in
memory) followed by 2 writes.  The Linux RAID5 personality isn't quite
that stupid and actually uses a slightly different algorithm involving
reading old data + parity off disk, masking, and then writing the new
data + parity back.

In any case Linux software RAID keeps a stripe cache around to cut
down on the disk I/Os caused by parity updates.  And this cache really
improves performance.

Now.  Unlike the other Linux filesystems, XFS does not stick to one
I/O size.  The filesystem data blocks are 4K (on PC anyway), but log
entries will be written in 512 byte chunks.

Unfortunately these 512 byte I/Os will cause the RAID5 code to flush
its entire stripe cache and reconfigure it for 512 byte I/O sizes.
Then, a few ms later, we come back and do a 4K data write, causing the
damn thing to be flushed again.  And so on.

IOW, Linux software RAID5 code was written for filesystems like ext2
that only do fixed size I/Os.

So the real problem is that because XFS keeps switching the I/O size,
the RAID5 code effectively runs without a stripe cache.  And that's
what's making the huge sucking sound.  This will be fixed -
eventually...

By moving the XFS journal to a different device (like a software RAID1
as we suggest in the FAQ), you can work around this problem.

And finally - All the hardware RAID controllers I have worked with
stick to one I/O size internally and don't have this problem.  They do
read-modify-write on their own preferred size I/Os anyway.

======================================================================

Правда, есть и решение - переместить лог на другое устройство:

=====================================================================

Of course the thing I missed is that if you run growfs to grow a log
it comes back and says:

xfs_growfs: log growth not supported yet

xfs_db also has some endian issues with the write command. I did however
manage to grow a log:

1. select your partition to become the log and dd a bunch of zeros over the
   complete range you want to be the log, so 32768 blocks would be:

   dd if=/dev/zero of=/dev/XXX bs=32768 count=4096

2. run xfs_db -x on the original unmounted filesystem use sb 0 to get to
   the super block.

3. reset the log offset using 

	write logstart 0

4. set the new log size using

	write logblocks xxxx

   where xxxx is the size in 4K blocks, xfs_db will come out with a new
   value which is different than this, feed this new value back into the
   same command and it will report the correct version. This is the endian
   conversion bug in xfs_db:

	xfs_db: write logblocks 32768
	logblocks = 8388608
	xfs_db: write logblocks 8388608
	logblocks = 32768

5. mount the filesystem using the logdev option to point at the new log:

         mount -t xfs -o logbufs=4,osyncisdsync,logdev=/dev/sda6 /dev/sda5 /xfs

You now have a filesystem with a new external log. Going back is harder since
you need to zero the old log and reset the logstart and logblocks fields.

It does occur to me that by using a different logstart than zero you could
put two external logs on the same partition, not sure what happens with
the device open close logic if you do this though.

Steve

============================================================================

Вот так вот...

-- 
С уважением, Алексей Любимов avl@cad.ru

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [sisyphus] xfs на software raid5.
  2003-03-13 17:33 [sisyphus] xfs на software raid5 Alexey V. Lubimov
@ 2003-03-14 10:57 ` Владимир
  0 siblings, 0 replies; 2+ messages in thread
From: Владимир @ 2003-03-14 10:57 UTC (permalink / raw)
  To: sisyphus

Alexey V. Lubimov пишет:

>Кажись, нашел объяснение постоянным сообщениям в логе 
>
>
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>...
>raid5: switching cache buffer size, 4096 --> 512
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>
>и предположение о том, что это неэффективно тоже подтвердилось.
>  
>

Создал xfs на soft-raid5 и тоже стал наблюдать подобное.
Захотел "победить хитростью". Версия 1.2  позволяет
при создании ф.с. указывать размер блока. Сделал

mkfs.xfs -f -b size=512 /dev/md2

Поставил "заливку".
Прошло некоторое время, наблюдаю, размер кэша
все равно прыгает (не так часто, и тем не менее).
Видимо вынос журнала на отдельный раздел единственный
способ повысить производительность XFS на soft-raid5

Опа, linux "повесился" (kernel panic). Вот и вся "хитрость".
 

>==========================================================
>
>This seems to be an area of constant confusion.  Let me describe why
>XFS sucks on Linux software RAID5.  It has nothing to do with
>controllers, physical disk layout, or anything like that.
>
>RAID5 works by saving a N-1 chunks of data followed by a chunk of
>parity information (the location of the checksum chunk is actually
>interleaved between devices with RAID5, but whatever).  These N-1 data
>chunks + the parity blob make out a stripe.
>
>Every time you update any chunk of data you need to read in the rest
>of the data chunks in that stripe, calculate the parity, and then
>write out the modified data chunk + parity.
>
>This sucks performance-wise because a write could worst case end up
>causing N-2 reads (at this point you have your updated chunk in
>memory) followed by 2 writes.  The Linux RAID5 personality isn't quite
>that stupid and actually uses a slightly different algorithm involving
>reading old data + parity off disk, masking, and then writing the new
>data + parity back.
>
>In any case Linux software RAID keeps a stripe cache around to cut
>down on the disk I/Os caused by parity updates.  And this cache really
>improves performance.
>
>Now.  Unlike the other Linux filesystems, XFS does not stick to one
>I/O size.  The filesystem data blocks are 4K (on PC anyway), but log
>entries will be written in 512 byte chunks.
>
>Unfortunately these 512 byte I/Os will cause the RAID5 code to flush
>its entire stripe cache and reconfigure it for 512 byte I/O sizes.
>Then, a few ms later, we come back and do a 4K data write, causing the
>damn thing to be flushed again.  And so on.
>
>IOW, Linux software RAID5 code was written for filesystems like ext2
>that only do fixed size I/Os.
>
>So the real problem is that because XFS keeps switching the I/O size,
>the RAID5 code effectively runs without a stripe cache.  And that's
>what's making the huge sucking sound.  This will be fixed -
>eventually...
>
>By moving the XFS journal to a different device (like a software RAID1
>as we suggest in the FAQ), you can work around this problem.
>
>And finally - All the hardware RAID controllers I have worked with
>stick to one I/O size internally and don't have this problem.  They do
>read-modify-write on their own preferred size I/Os anyway.
>
>======================================================================
>
>Правда, есть и решение - переместить лог на другое устройство:
>
>=====================================================================
>
>Of course the thing I missed is that if you run growfs to grow a log
>it comes back and says:
>
>xfs_growfs: log growth not supported yet
>
>xfs_db also has some endian issues with the write command. I did however
>manage to grow a log:
>
>1. select your partition to become the log and dd a bunch of zeros over the
>   complete range you want to be the log, so 32768 blocks would be:
>
>   dd if=/dev/zero of=/dev/XXX bs=32768 count=4096
>
>2. run xfs_db -x on the original unmounted filesystem use sb 0 to get to
>   the super block.
>
>3. reset the log offset using 
>
>	write logstart 0
>
>4. set the new log size using
>
>	write logblocks xxxx
>
>   where xxxx is the size in 4K blocks, xfs_db will come out with a new
>   value which is different than this, feed this new value back into the
>   same command and it will report the correct version. This is the endian
>   conversion bug in xfs_db:
>
>	xfs_db: write logblocks 32768
>	logblocks = 8388608
>	xfs_db: write logblocks 8388608
>	logblocks = 32768
>
>5. mount the filesystem using the logdev option to point at the new log:
>
>         mount -t xfs -o logbufs=4,osyncisdsync,logdev=/dev/sda6 /dev/sda5 /xfs
>
>You now have a filesystem with a new external log. Going back is harder since
>you need to zero the old log and reset the logstart and logblocks fields.
>
>It does occur to me that by using a different logstart than zero you could
>put two external logs on the same partition, not sure what happens with
>the device open close logic if you do this though.
>
>Steve
>
>============================================================================
>
>
>
>Вот так вот...
>
>
>
>
>  
>


-- 
Best regards
Vladimir




^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-03-14 10:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-13 17:33 [sisyphus] xfs на software raid5 Alexey V. Lubimov
2003-03-14 10:57 ` Владимир

ALT Linux Sisyphus discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://lore.altlinux.org/sisyphus/0 sisyphus/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 sisyphus sisyphus/ http://lore.altlinux.org/sisyphus \
		sisyphus@altlinux.ru sisyphus@altlinux.org sisyphus@lists.altlinux.org sisyphus@lists.altlinux.ru sisyphus@lists.altlinux.com sisyphus@linuxteam.iplabs.ru sisyphus@list.linux-os.ru
	public-inbox-index sisyphus

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://lore.altlinux.org/org.altlinux.lists.sisyphus


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git