Datto backs up data, a lot of it. At the time of writing Datto has over 500 PB of data stored on ZFS. This count includes both backup appliances that are sent to customer sites, as well as cloud storage servers that are used for secondary and tertiary backup of those appliances. At this scale drive swaps are a daily occurrence, and data corruption is inevitable. How we handle this corruption when it happens determines whether we truly lose data, or successfully restore from secondary backup. In this post we'll be showing you how at Datto we intentionally cause corruption in our testing environments, to ensure we're building software that can properly handle these scenarios.
Disclaimer: You should absolutely not attempt these instructions on any system containing any data you would like to keep. I provide no guarantees that the commands within this post will not completely destroy your zpool and all its contained data. But we'll try to only destroy it a little bit.
What is ZFS?
ZFS is a filesystem and volume manager with a variety of properties that make it ideal for storing large amounts of data. Amongst these are:
- Automatic detection/repair of silent corruption ("bit rot")
- Constant time filesystem snapshotting
- Constant time restore of those snapshots (i.e. no delta merges)
- Transactional disk writes
- The ability to send and receive snapshots for off-site backup
ZFS forms the foundation for both the data backup and disaster recovery mechanisms at Datto.
What is "Corruption"?
ZFS has mechanisms (referred to as "scrubbing") to detect and repair silent data errors. ZFS also gracefully handles drive failure and drive swaps.
In this case, by "corruption" we mean permanent data loss, where all of ZFS's internal backup replicas of the data are lost or destroyed. In a real life scenario this would be a trigger to recover the destroyed data from off-site replication (i.e. the Datto cloud), or from secondary cloud backup.
In practice these scenarios are rare, since ZFS is designed around preventing permanent corruption. But therein lies the problem. Since these corruption events are so rare, it's hard to write code to handle these scenarios. Unless, of course, you can cause corruption yourself!
Setup
For this example we'll be using a realistic example, a zpool with a single mirror vdev with 2 backing physical disks:
$ zpool status -L
pool: tank
state: ONLINE
scan: scrub repaired 0B in 1h0m with 0 errors
...
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
...
errors: No known data errors
...
As you can see this pool has been recently scrubbed and has no known errors.
Causing Corruption
Since this is a mirror setup, a naive solution to cause corruption would be to randomly dd the same sectors of both /dev/sdb
and /dev/sdc
. This works, but is equally likely to just overwrite random unused space, or take down the zpool entirely. What we really want is to corrupt a specific snapshot, or even a specific file in that snapshot, to simulate a more realistic minor corruption event. Luckily we have a tool called zdb that lets us view some low level information about datasets.
First let's create a dataset:
$ zfs create tank/corrupt_me
Add some dummy data:
$ echo "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed" \
"do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
> /tank/corrupt_me/test.txt
And create a snapshot:
$ zfs snap tank/corrupt_me@snap
We'll use this as our corruption target.
Lets see what zdb can tell us about this dataset:
$ zdb -ddd tank/corrupt_me@snap
Dataset tank/corrupt_me@snap [ZPL], ID 6100, cr_txg 7407961, 100K, 7 objects
...
Object lvl iblk dblk dsize dnsize lsize %full type
0 6 128K 16K 56.0K 512 32K 10.94 DMU dnode
-1 1 128K 512 0 512 512 100.00 ZFS user/group used
-2 1 128K 512 0 512 512 100.00 ZFS user/group used
...
2 1 128K 512 4K 512 512 100.00 ZFS plain file
...
Note: For those following along, if any of these commands fail try running zdb -eddd
. This will bypass the zpool cache.
As you can see, this gives us a list of any ZFS objects associated with the specified dataset/snapshot. Since we only created one file in this dataset, this "ZFS plain file" must be what we're looking for. We can dive even deeper on the object with extra verbosity (more d's!):
$ zdb -ddddd tank/corrupt_me@snap
...
Object lvl iblk dblk dsize dnsize lsize %full type
2 1 128K 512 4K 512 512 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path /test.txt
...
Indirect blocks:
0 L0 0:4f110bf000:1000 200L/200P F=1 B=7407956/7407956
segment [0000000000000000, 0000000000000200) size 512
Notice path /test.txt
. That's the file we created earlier, so this confirms we're on the right track. This output also gives you the indirect block address (0:4f110bf000:1000
). Without getting too deep into ZFS internals - the indirect block, in this case, stores the contents of our test.txt file. But you don't have to believe me, we can prove it with another zdb command:
$ zdb -R tank 0:4f110bf000:1000 | head
Found vdev type: mirror
0:4f110bf000:1000
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 4c6f72656d206970 73756d20646f6c6f Lorem ipsum dolo
000010: 722073697420616d 65742c20636f6e73 r sit amet, cons
000020: 6563746574757220 6164697069736369 ectetur adipisci
000030: 6e6720656c69742c 2073656420646f20 ng elit, sed do
000040: 656975736d6f6420 74656d706f722069 eiusmod tempor i
000050: 6e6369646964756e 74207574206c6162 ncididunt ut lab
000060: 6f72652065742064 6f6c6f7265206d61 ore et dolore ma
As you can see the -R
command takes a ZFS block address, and displays its content in several formats. Now we're really getting somewhere, we can see the lorem ipsum text we added to this file earlier. The indirect block address is in the format of [vdev]:[byte offset in hex]:[size]
.
From the zpool status output earlier we know that our mirror vdev has two physical backing disks, sdb and sdc. Let's pick one of them (sdb) to see if we can use the indirect block offset to read our data directly off the disk (and eventually corrupt it).
Failed attempt:
$ dd if=/dev/sdb1 bs=512 skip=$((0x4f110bf000 / 512)) count=1 \
| hexdump -C
1+0 records in
1+0 records out
512 bytes copied, 0.0123137 s, 41.6 kB/s
00000000 1f 80 26 b1 14 00 00 00 01 80 48 b1 14 00 00 00 |..&.......H.....|
00000010 08 80 4c b1 14 00 00 00 01 80 56 b1 14 00 00 00 |..L.......V.....|
00000020 00 80 5b b1 14 00 00 00 01 80 5e b1 14 00 00 00 |..[.......^.....|
00000030 00 80 61 b1 14 00 00 00 1a 80 63 b1 14 00 00 00 |..a.......c.....|
00000040 05 80 81 b1 14 00 00 00 8b 80 88 b1 14 00 00 00 |................|
00000050 4f 80 15 b2 14 00 00 00 c2 82 67 b2 14 00 00 00 |O.........g.....|
00000060 1a 80 2c b5 14 00 00 00 00 80 4b b5 14 00 00 00 |..,.......K.....|
It's just junk. Let's break down what we're attempting:
Read from /dev/sdb1
(our physical mirror member) with block size 512 bytes, skip ahead to our offset 0x4f110bf000
(divide by block size, because skip takes a number in blocks) and read a single block. So why didn't it work? To find the answer we need to dive into the zfs on disk specification. The relevant section is:
The value stored in offset is the offset in terms of sectors
(512 byte blocks). To find the physical block byte offset from
the beginning of a slice, the value inside offset must be
shifted over (<<) by 9 (29 =512) and this value must be added
to 0x400000 (size of two vdev_labels and boot block).
physical block address = (offset << 9) + 0x400000 (4MB)
To further add to the confusion, zdb automatically converts to bytes (rather than blocks), so we actually don't have to shift. But this gives us the information we need, we just need to skip the first 4MB of the physical disk.
$ dd if=/dev/sdb1 bs=512 \
skip=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1 \
| hexdump -C
1+0 records in
1+0 records out
512 bytes copied, 0.0182629 s, 28.0 kB/s
00000000 4c 6f 72 65 6d 20 69 70 73 75 6d 20 64 6f 6c 6f |Lorem ipsum dolo|
00000010 72 20 73 69 74 20 61 6d 65 74 2c 20 63 6f 6e 73 |r sit amet, cons|
00000020 65 63 74 65 74 75 72 20 61 64 69 70 69 73 63 69 |ectetur adipisci|
00000030 6e 67 20 65 6c 69 74 2c 20 73 65 64 20 64 6f 20 |ng elit, sed do |
00000040 65 69 75 73 6d 6f 64 20 74 65 6d 70 6f 72 20 69 |eiusmod tempor i|
00000050 6e 63 69 64 69 64 75 6e 74 20 75 74 20 6c 61 62 |ncididunt ut lab|
00000060 6f 72 65 20 65 74 20 64 6f 6c 6f 72 65 20 6d 61 |ore et dolore ma|
00000070 67 6e 61 20 61 6c 69 71 75 61 2e 0a 00 00 00 00 |gna aliqua......|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
There's our lorem ipsum! Now, instead of looking at the data, we need to overwrite it. Don't forget to overwrite both physical disks in the mirror:
$ dd if=/dev/urandom of=/dev/sdb1 bs=512 \
seek=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1
1+0 records in
1+0 records out
512 bytes copied, 0.0199633 s, 25.6 kB/s
$ dd if=/dev/urandom of=/dev/sdc1 bs=512 \
seek=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1
1+0 records in
1+0 records out
512 bytes copied, 0.000700771 s, 731 kB/s
At this point we just need to trigger a read of the data in our snapshot. You could do this by cloning the snapshot and reading the bad block, but you can also just trigger a scrub:
$ zpool scrub tank
After the scrub is complete:
$ zpool status -Lv
...
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
sdb ONLINE 0 0 2
sdc ONLINE 0 0 2
...
errors: Permanent errors have been detected in the following files:
tank/corrupt_me@snap:/test.txt
Congratulations, you've successfully destroyed your data!
Conclusion
At the 500 PB scale, it's not a matter of if data corruption will happen but when. Intentionally causing corruption is one of the strategies we use to ensure we're building software that can handle these rare (but inevitable) events.
To others out there using ZFS: I'm curious to hear how you've solved this problem. We did quite a bit of experimentation with zinject before going with this more brute force method. So I'd be especially interested if you've had luck simply simulating corruption with zinject.