From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 13 Mar 2003 20:33:12 +0300 From: "Alexey V. Lubimov" To: sisyphus@altlinux.ru Message-Id: <20030313203312.5b36a942.avl@l14.ru> Organization: RIC X-Mailer: Sylpheed version 0.8.8 (GTK+ 1.2.10; i586-alt-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 8bit Subject: [sisyphus] xfs =?KOI8-R?Q?=CE=C1?= software raid5. Sender: sisyphus-admin@altlinux.ru Errors-To: sisyphus-admin@altlinux.ru X-BeenThere: sisyphus@altlinux.ru X-Mailman-Version: 2.0.9 Precedence: bulk Reply-To: sisyphus@altlinux.ru List-Unsubscribe: , List-Id: List-Post: List-Help: List-Subscribe: , List-Archive: Archived-At: List-Archive: Кажись, нашел объяснение постоянным сообщениям в логе raid5: switching cache buffer size, 512 --> 4096 raid5: switching cache buffer size, 4096 --> 512 ... raid5: switching cache buffer size, 4096 --> 512 raid5: switching cache buffer size, 512 --> 4096 raid5: switching cache buffer size, 4096 --> 512 raid5: switching cache buffer size, 512 --> 4096 raid5: switching cache buffer size, 4096 --> 512 и предположение о том, что это неэффективно тоже подтвердилось. ========================================================== This seems to be an area of constant confusion. Let me describe why XFS sucks on Linux software RAID5. It has nothing to do with controllers, physical disk layout, or anything like that. RAID5 works by saving a N-1 chunks of data followed by a chunk of parity information (the location of the checksum chunk is actually interleaved between devices with RAID5, but whatever). These N-1 data chunks + the parity blob make out a stripe. Every time you update any chunk of data you need to read in the rest of the data chunks in that stripe, calculate the parity, and then write out the modified data chunk + parity. This sucks performance-wise because a write could worst case end up causing N-2 reads (at this point you have your updated chunk in memory) followed by 2 writes. The Linux RAID5 personality isn't quite that stupid and actually uses a slightly different algorithm involving reading old data + parity off disk, masking, and then writing the new data + parity back. In any case Linux software RAID keeps a stripe cache around to cut down on the disk I/Os caused by parity updates. And this cache really improves performance. Now. Unlike the other Linux filesystems, XFS does not stick to one I/O size. The filesystem data blocks are 4K (on PC anyway), but log entries will be written in 512 byte chunks. Unfortunately these 512 byte I/Os will cause the RAID5 code to flush its entire stripe cache and reconfigure it for 512 byte I/O sizes. Then, a few ms later, we come back and do a 4K data write, causing the damn thing to be flushed again. And so on. IOW, Linux software RAID5 code was written for filesystems like ext2 that only do fixed size I/Os. So the real problem is that because XFS keeps switching the I/O size, the RAID5 code effectively runs without a stripe cache. And that's what's making the huge sucking sound. This will be fixed - eventually... By moving the XFS journal to a different device (like a software RAID1 as we suggest in the FAQ), you can work around this problem. And finally - All the hardware RAID controllers I have worked with stick to one I/O size internally and don't have this problem. They do read-modify-write on their own preferred size I/Os anyway. ====================================================================== Правда, есть и решение - переместить лог на другое устройство: ===================================================================== Of course the thing I missed is that if you run growfs to grow a log it comes back and says: xfs_growfs: log growth not supported yet xfs_db also has some endian issues with the write command. I did however manage to grow a log: 1. select your partition to become the log and dd a bunch of zeros over the complete range you want to be the log, so 32768 blocks would be: dd if=/dev/zero of=/dev/XXX bs=32768 count=4096 2. run xfs_db -x on the original unmounted filesystem use sb 0 to get to the super block. 3. reset the log offset using write logstart 0 4. set the new log size using write logblocks xxxx where xxxx is the size in 4K blocks, xfs_db will come out with a new value which is different than this, feed this new value back into the same command and it will report the correct version. This is the endian conversion bug in xfs_db: xfs_db: write logblocks 32768 logblocks = 8388608 xfs_db: write logblocks 8388608 logblocks = 32768 5. mount the filesystem using the logdev option to point at the new log: mount -t xfs -o logbufs=4,osyncisdsync,logdev=/dev/sda6 /dev/sda5 /xfs You now have a filesystem with a new external log. Going back is harder since you need to zero the old log and reset the logstart and logblocks fields. It does occur to me that by using a different logstart than zero you could put two external logs on the same partition, not sure what happens with the device open close logic if you do this though. Steve ============================================================================ Вот так вот... -- С уважением, Алексей Любимов avl@cad.ru