One of the things that ZFS boast most is its scalability -- Z is for zetabyte after all. Trivia question: what is the first thing you do after you put data on your production ZFS volume? That's right, you back it up to your backup infrastructure. A lot of systems use tar or other archive like derivatives to manage backups. This technique is particularly awful with databases. Databases usually consist of very very large files (multi-gigabyte) that have minimal changes to them. With full archive systems, any attempt at incremental backups results in horrible space and time inefficiency as a small (8192 byte) change in a datafile will necessitate the whole file to be backed up in the next incremental.
Enter block-level incremental (BLI) backups. The idea here is that you ask your filesystem to track which blocks change from a certain moment in time. And you can ask the filesystem for all blocks of a filesystem (view consistent, of course) and then later ask it for the changeset. In other words its like doing:
- snapshot FS as 'base', backup 'base'.
- wait a day
- snapshot FS as 'inc1', backup diff 'base' 'inc1'
Filesystems have supported this type of behavior for a while now (Veritas VxFS has a magnificent implementation). Needless to say I was ecstatic when I read the zfs manpage and learned of the 'zfs send' and 'zfs recv' operation. Functionally, they implement BLIs.
We have a database on which we have around 1TB of information on zfs. So, I figured we'd whip together a script to tie in zfs send (including incremental support) to our Veritas NetBackup infrastructure.
We have three mount points that we need to snap and send to NetBackup, so I create three FIFOs on disk and fork off three parallel 'zfs send' operations. Then I fork off three parallel netbackup jobs (one for each FIFO). We have three tape heads so, they all actually run in parallel and should fly like the wind (all over GigE).
# date; ./zbackup.sh -s -l 2006121402; date
Thu Dec 14 12:58:43 EST 2006
./zbackup.sh:
backuplabel: 2006121402
full
zfs destroy intmirror/xlogs@lastfull
zfs destroy xsr_slow_1/pgdata@lastfull
zfs destroy xsr_slow_2/pgdata@lastfull
Backing up as '2006121402'
starting postgres backup on label 2006121402
zfs snapshot intmirror/xlogs@lastfull
zfs snapshot xsr_slow_1/pgdata@lastfull
zfs snapshot xsr_slow_2/pgdata@lastfull
stopping postgres backup on label 2006121402
/sbin/zfs send intmirror/xlogs@lastfull >> /pgods/scratch/intmirror:xlogs.lastfull.full &
/sbin/zfs send xsr_slow_1/pgdata@lastfull >> /pgods/scratch/xsr_slow_1:pgdata.lastfull.full &
/sbin/zfs send xsr_slow_2/pgdata@lastfull >> /pgods/scratch/xsr_slow_2:pgdata.lastfull.full &
Sat Dec 23 15:39:47 EST 2006
SWEET JESUS! That's a 9 day, 2 hour, 41 minutes and 4 second backup. Somehow I think that doing daily incremental backups is out of the question. I tried zfs send redirected to /dev/null (just to demonstrate that netbackup was not the bottleneck) and, as expected, there was no noticeable speedup. I've tested this on some other machines and got the send operation to run quite fast. However, any time a very competitive I/O load is added, it just suffers miserably and becomes so slow that it is useless.
Reading the source code to the ZFS layer leads me to believe that all the operations for doing the send are scheduled serially (each after the previous completes) and compete equally for system I/O with all other processes. I saw no intuitive way to make the ioctl()s with ZFS act as if they were more important that other things going on in the system. This leads me to believe that it may not be so easy to fix. However, those Sun engineers have wicked tricks up their sleeves and tend to pull of some amazing feats. So, here's hoping!
Until then, I hereby suggest that the 'zfs send' be renamed 'zfs trickle'.
Wednesday, March 21. 2007 at 02:57 (Reply)
This blog entry was mentioned on zfs-discuss@opensolaris.org, and I responded:
The author notes that it took 9 days to do a full zfs send. Elsewhere they note that they have "about 1TB of information on ZFS", so I'm left to guess that their zfs send went at about 1.3MB/s. Without knowing their underlying storage hardware, I couldn't say what a reasonable expectation would be, but even a single modern spindle could do more sequential reads. Random I/O is another matter, so the layout of the data on disk would play in. Any other load on the system would also impact the time that the 'zfs send' would be expected to take.
That said, I'd still guess that they are right -- 'zfs send' could be a lot faster if it issued more i/o in parallel. Finding the right balance of 'zfs send' performance vs. other i/o priority will be tricky, but it's something we're going to work on.
Based on the time it took to do a full zfs send, the author says "Somehow I think that doing daily incremental backups is out of the question." However, the data does not support this conclusion. If the amount of data changed is small (which the author claims: "very very large files ... that have minimal changes to them"), then the incremental zfs send will be quite fast.
While I agree that large improvements are possible, the data presented does not support the conclusion that zfs send is not an acceptable solution for daily incremental backups for this workload.
Wednesday, March 21. 2007 at 10:53 (Link) (Reply)
We use ZFS send to back up a large variety of ZFS filesystems around our organization and it works pretty well -- we use ZetaBack (https://labs.omniti.com/trac/zetaback). This is the only system we have this problem on and I believe it is due to the work load.
This system sports 36 SCSI spindles in one array, 7 SATA spindles in a second, 7 SATA spindles in a third and 2 SCSI spindles internally. The issue is that the 36+7+7 are commonly (over 99% of the time) fully saturated (100% utilization). And as I have many many processes competing for these resources, the single zfs send really looses out.
What I think would work well is to be able to (1) tune the parallelism of the zfs send ioctl() as well as (2) be able make all the I/O ops jump to the front of the I/O queue. If these were tunables I could likley accomplish the balance of zfs send speed to regular workload slowdown to my liking.