[afnog] Read/Write Comparison

Brian Candler B.Candler at pobox.com
Thu Aug 24 09:46:30 SAST 2006


On Thu, Aug 24, 2006 at 08:45:25AM +0300, Tony Kinyua wrote:
> At the risk of starting a flame war I would like to request the list's opinion 
> on how UFS (read freeBSD default) compares to other journalled file systems 

Unfortunately, UFS is not a journaled file system. There is very recent work
in FreeBSD-CURRENT to add a gjournal layer as part of the GEOM subsystem.
>From what I've read this looks like it will be very good.

But as of now, there's no native journalling filesystem. With UFS you get
softupdates, which can help reduce the occurence of filesystem corruption on
sudden power off, but doesn't eliminate the need for fsck. You also get
background fsck these days.

> like XFS, Reiserfs or ext3 (read Linux default) when it comes to multiple 
> heavy read/write operations on a fairly large raided partition >100GB at the 
> rate of thousands per second.

You need to consider also what type of RAID setup you have, regardless of
the filesystem which runs on top of it.

With RAID 1 (mirroring), your write performance will be slightly slower than
a single disk, but you will be able to have twice as many read operations
per second, because there are two copies of the same data.

With traditional RAID 5, reads will be around the same speed, but writes
will be *much* slower than a single disk. This is because a write of a
single block requires four disk transactions across two disks:
- read the old data block
- read the old parity block
- write the new data block
- write the new parity block (calculated as old parity XOR old data XOR
  new data)

It doesn't matter if you have an external RAID5 controller: this is a "laws
of physics" thing. It's how RAID5 works.

Things are improved to some degree if you have a battery-backed
write-through cache, if the same blocks are being written over and over.

The only way you will get good performance on a RAID5-type setup is when you
have a custom filesystem combined with write-through cache, and NetApp's
WAFL filesystem is the only one I know of that can do that. By organising
the location of writes appropriately, combined with battery backup, it can
write a stripe at a time, avoiding the need to read old data and parity off
the disk.

The performance of NetApp fileserver appliances is astonishing. There's a
high price tag to pay, but if you are serious about having thousands of
writes per second, I strongly recommend this route.

OTOH, 100GB is not a lot of data. A typical disk can handle 200-250
operations per second. If you had 8 disks, set up as four sets of 2 mirrored
pairs, with your data spread appropriately across the four volumes, you
might be able to achieve 1,000 operations per second. It depends on the
actual usage patterns - are these lots of small files, or fewer large files?
Mostly reads, or equal mix of reads and writes? If it's a mail system it
will be a mix of reads and writes of small files, with lots of inode
creations and deletions. You'll certainly be better with mirroring than
RAID5.

> What am looking at is
> 1. Is UFS consistent enough in its journalizing to reduce the risk of file 
> system errors?

As I say, there is no journalling :-(

> 2. Why do a majority of large email services prefer Linux for core email 
> services?

Probably because it's what they know or have heard of. Possibly because it's
easier to buy a support contract for a Linux system than a BSD one. Possibly
because several hardware vendors support Linux.

I built a large E-mail cluster using FreeBSD at the front, and NetApp as the
storage, using Maildir over NFS, about 5 years ago. It has hundreds of
thousands of users. It's still running and has needed very little
administration.

HTH,

Brian.


More information about the afnog mailing list