A few police-action stories from working on the Datto Linux backup agent

In the beginning there was data.

And it was good.

Good data. Good data. pant pant pant. bark bark. sluuuuuurp.

And data wanted to be free!

It wanted to run through the fields and feel the wind in its hair.

But data also wanted to be backed up.

Then the dinosaurs came but they got too big and fat and they all turned into oil yada yada yada, fast forward a few million years and Datto was born.

Datto’s Business Continuity line of products have historically been centered around backing up Windows machines. However, our customers often have important Linux machines deployed as well, and they’d been asking us for a while to expand our product offering to support Linux. So we wrote the Datto Linux Agent (DLA).

The Linux Agent is an application that runs on a Linux machine that enables Datto to back up some or all of the volumes on that machine to Datto’s backup appliance, called a SIRIS. The SIRIS manages all aspects of Business Continuity and Disaster Recovery. The agent is responsible for getting the backup data from the protected machine to the SIRIS.

This is the story of some of the fun we had while working to make the Linux Agent a mature product (read: dealing with all of the unknown unknowns).

Background

The Datto Linux Agent is comprised of two pieces, a kernel module which fabricates a virtual block device that represents a point-in-time snapshot of a real block device for a volume on disk, and the userspace application that reads from this virtual device and sends the data to the SIRIS.

A friend of mine says “everything in IT is about taking data from one place, mushing it up a little, and putting it somewhere else.” And if you think about it, it’s kind of true. Web applications read information from a database, format it in HTML, and write it to an output stream of some kind, that eventually gets to a browser. Batch processing applications read data from one data store, a database, a flat file, etc, do a little manipulation, and then write it out to some other data store.

At Datto, our job is even easier. We’re taking backups. We just have to take the data from here, and put it there. We don’t even have to worry about any kind of manipulation. How hard can that possibly be? Read data from here, write it there.

Well, it’s not as simple as you might think.

Microsoft supplies a variety of versions of Windows, and various service packs. A known and reasonable quantity. As of this writing, distrowatch.com says there are 285 active Linux distributions.

And while they are all based on the same Linux kernel, and many of them don’t apply their specific combination of patches to the kernel, that’s still quite a few permutations of Linux distributions and kernel versions. Then there’s initrd, initramfs, and dracut. And package managers ...

Microsoft includes, in modern versions of Windows, something called VSS (volume snapshot service) which allows a userspace application to get a point in time snapshot of a volume.

Linux has no such thing. So we had to write our own, thus dattobd was born. Dattobd is an open-source Linux kernel module that creates a point-in-time snapshot of a block device, which is used by dlad (Datto’s userspace backup daemon) to perform the actual backup of the volume to the SIRIS backup appliance.

Architecture

During the process of taking a backup, dlad instructs dattobd (via ioctl) to create a virtual volume (the snapshot block device) which represents a physical volume as it was at the point in time the snapshot was taken. Dattobd creates this point-in-time snapshot by keeping track of what blocks are being written to during live use of the physical volume. When a block is about to be changed by a write operation, the dattobd kernel module copies the data that was on the disk before the write started and puts it in a copy-on-write (COW) file that is referred to by the point in time snapshot virtual volume. Dattobd then is able to present a snapshot point in time image of the volume by combining unchanged data from the live volume with copies of the live-volume-data-before-it-was-over-written, from the COW file.

After the snapshot is created, dlad then copies all of the data off the virtual volume and sends it to the SIRIS appliance. When the copy is complete, dlad tells dattobd to transition to a state where it keeps track of the list of blocks that are written to the live volume (which is a lightweight operation) but it no longer has to maintain the point-in-time snapshot (COW file). This list of block changes is used for the next incremental backup to only send the blocks that have changed since the previous backup.

When writing a complex kernel module like dattobd, you get to learn a lot about the innards of the block device layer in the Linux kernel and the myriad debugging tools available to the kernel developer, read: almost none. But we learned a lot.

If not for the courage of the fearless crew, the minnow would be lost.

Below is a small selection of some of the more interesting problems we ran into during our journey to making the Linux agent a stable and robust product.

A lesson in support of old hardware.

It doesn’t matter if you don’t use old hardware, you can still get burned by it. Because somewhere, somebody still uses it, and support for it is still in the kernel ready to trip you up when you least expect. So it went with the floppy driver.

Some Linux distributions load the floppy driver in as a default module since somebody might someday again use a floppy disk for something other than playing music on.

The Datto Linux Agent uses libblkid to get a list of devices on the machine. Libblkid probes each device to get information about it, but when it tries to probe the floppy driver… it hangs. Your process ends up in “uninterruptible sleep” state. Short of fixing the floppy driver itself, there’s no way around it, except to unload the floppy driver before probing.

We actually got a few support calls about this, and it took us a while to figure out what was causing the problem, because so few systems nowadays have floppy support enabled, installed and loaded. We were finally able to reproduce this when testing with VMs, because none of us actually had a real floppy drive lying around. We have a lot of weird hardware bouncing around at Datto, but not so much with the floppy drives. So, as it stands, we cannot back up your machine if you use a floppy drive. Thankfully we haven’t actually run into this requirement. The few customers who did have a problem were happy to unload the floppy driver kernel module.

A lesson in journaling filesystem drivers operating on non journaling filesystems.

Part of the process of creating a point-in-time snapshot of a block device is to get it into as consistent-on-disk of a state as possible by flushing all pending writes to disk, and then atomically switching the kernel module to a mode where you track all changes since the point-in-time snapshot you’re creating. You can achieve this atomicity by blocking all access to the filesystem while you switch.

The Linux kernel provides a tool for freezing a filesystem called, curiously enough, fsfreeze. So you sync sync sync your filesystem, then fsfreeze it, then you have a point in time you can start tracking changes from.

That procedure works well enough, unless you happen to have an ext2 boot volume and another ext4 volume mounted at the same time, which happened to be what we had for a QA setup to test with. Calling fsfreeze on this combination of filesystems causes the ext4 driver to try and recover the journal on the ext2 volume… which doesn’t have a journal.

The practical upshot of calling fsfreeze on an ext2 file system that’s being handled by the ext4 driver is that it makes your ext2 block device disappear. I suppose it could be worse, it could render your /boot partition unbootable.

As it turns out, it does that too, if you don’t unmount and remount /boot again before rebooting. I’m glad we found this problem before our customers did.

From this experience I also learned that ‘derp’ is a technical term, which you can read about here. Kudos to Eric Sandeen from Red Hat, who jumped on this less than 20 minutes after I reported it.

Being fedora, we were lucky to find the problem and get it fixed before it made it to Debian or Red Hat Enterprise Linux, so it ended up working out. Another feather in our collective caps.

A lesson in finding out if you’re really copying everything you think you are.

As it turns out, it’s really really hard to find out if you are copying everything you want to copy from a filesystem while it’s in use. When you’re talking about backing up some large number of 4K blocks on a 250, maybe 500 gig volume, making sure you got every block you wanted to is not a trivial task. How can you tell if you’re backing up everything you expect to? What can you use as your original reference point? You can’t use the original disk because it has changed while you were copying from it. You can’t take a copy of it beforehand, because, well, what are you going to copy it with? dd? Plenty of stuff is going to change on your disk while you’re trying to dd the data away somewhere else.

I suppose you can run dlad and dattobd in a VM, then take a VM snapshot of that VM and then mount the VM disk image on another machine, and while that would give you the state of the disk before you tried to copy it, it wouldn’t give you the state of the disk while you were trying to copy it.

After some issues from the field came in, we were pretty sure we had a problem. Some of our own tests were yielding backups with data from the future in it. In other words, data that had changed after the point-in-time snapshot it was ostensibly a part of. Where was this changed data coming from? The kernel module, or the userspace program? How do you find out where the system as a whole went wrong, when you’re missing a handful of 4k blocks amidst 500 gigs of data. Little pieces of needles in a very large haystack. A little calculator magic tells me that 4k in 500gig is 0.000000007629% of the disk.

That’s a lot of zeros after that little dot.

But we kept testing and kept testing, and we had an amazing QA guy, and in all his testing, he was able to reproduce the problem reasonably consistently on a 500meg volume.

That, finally, was something I could work with. If I did my calculator right, that’s about 131072 4k blocks. Nothing I could sift through by hand, but pretty manageable with the mighty power of the Intel x86 processor line, a bunch of scripts, and a lot of spare disk.

So I figured what I’d do, is make a version of the Linux Agent that recorded what it was doing with every block it read and wrote at every step of the backup process, including what was on the live disk at the time a block was being backed up. Then I could write a bunch of scripts to sift through the recordings and look for places where blocks were not set to what they should be, and from there the source of the problem should make itself more obvious.

So I hacked up the Linux agent to do just that. When reading from the snapshot device, (the source of the backup data) I’d make an extra copy of what I read, and write it somewhere to some other disk somewhere out of the way. Then I’d read the same block off the live disk as close to the same time as I could, and write that somewhere out of the way as well. Those two blocks, I figured, should be the same.

Then I’d change the data in that block on the live disk, and again read from the snapshot volume, and live volume at the same time, and write those somewhere else out of the way. Those should be different.

When I wrote to the backup appliance, I’d write those blocks to another place as well (that I could match up to my other copies), and then since I had no idea where the process was going wrong, I’d re-read the data from the backup appliance, and write another copy of that as well. Then I wrote a few scripts to validate that every block in each step was what it should be compared with the original, and compared with the real backup the agent was taking.

After setting all this up, all I had to do was take a bunch of backups and wait until something ended up not having the correct data in it. Even with a small 500 meg volume, it took a while and a number of repetitions before it happened. And then of course, there was always the risk that my test program wasn’t working right, or my scripts bungled it somehow.

But after many days of slogging through all these tests, we were able to nail down the source of the bad data.

It turned out to be an intermittent bug in the snapshot driver that caused it to sometimes pick up the live disk data instead of the copy-on-write data.

It was a long slog, but what we learned from this was: if you have a manageable amount of data to work with, and enough space to write out your test data to, and a lot of time, you can write a program to do most of the work of finding your problem for you.

Always use the computer to tell you what is wrong with the computer.

You can’t delete files.

Thankfully somebody thought of this before it caused a problem that would have been hard to figure out.

Since you don’t know how long your backup is going to take when you start it, you don’t know how much new data is going to be written during the backup, nor do you know how much disk you’re going to need to store the copy-on-write information for all that data written during the backup. So it makes sense that the COW file storing the copy-on-write information will grow arbitrarily large during the backup. Makes sense. No problem.

We decided early on that we were going to store a volume’s COW file directly on the volume that the snapshot was being taken of. Never end your sentence with a preposition. This way we can always guarantee there will be some disk available to write COW information to, rather than hoping there’s another disk around to use for scratch space.

But it turns out that having the COW file grow dynamically doesn’t work. Consider this:

Let’s say you start a backup, and you have created your point-in-time snapshot. Then, an application deletes a file. The space that that file took up is now free for use by the filesystem. Then a new write comes in for somewhere else on the disk, and we have to copy the before-the-write data that’s on that part of the disk to the COW file. Being zero length when we start, the COW file isn’t large enough, so we increase its size. The file system decides to use that space from the previously-deleted-file for the COW file.

Well, before we increase the COW file size to use that space, we have to copy that before-the-cow-write data to the COW file, so we can preserve what was there before the backup started (before the file was initially deleted), which means we have to increase the COW file size.

Well, before we increase the COW file size to use that space, we have to copy the before-the-cow-write data to the COW file... Glossary: “Infinite loop: see Infinite loop.”

So the answer is to pre-allocate the COW file to make sure we will never have to allocate more space for it during the backup (specifically avoiding writing over space from a file deleted during the backup). How much space do we need to reserve? Great question. The best minds at Datto spent months on this question, and the answer is a secret, and I will only divulge the secret to the highest bidder who pays in non-traceable cash. U.S. Dollars please.

As this was the first time we’d be backing up Linux machines, we didn’t have any historical data to pore over to see what the real usage patterns were like to determine how much COW file space we’d need, but we had to come up with something. So we picked 10% of the volume size, with a maximum size of 10gigs, and made it configurable, so we could change it if some particular system needed more COW file space.

Had we not realized this could be a problem before it happened, we probably would have had another similar slog of reproducing a super-recordable environment waiting for the problem to occur, until we could watch it fail in slow motion and see what was going on.

In summary of the experience.

We had a lot of fun working on the Datto Linux Agent, lots of lessons learned, lots of experience gained, and in the end, we ended up with a really solid Linux Agent for backing up live Linux systems.

These are just some of the more memorable moments working on the project. There’s plenty more.

I gotta say though, we get to do some pretty fun stuff at Datto.

About the Author

Stu Mark

Experienced software developer who's fun to be with.

GitHub   LinkedIn  

More from this author