Datto backs up data, a lot of it. At the time of writing Datto has over 500 PB of data stored on ZFS. This count includes both backup appliances that are sent to customer sites, as well as cloud storage servers that are used for secondary and tertiary backup of those appliances. At this scale drive swaps are a daily occurrence, and data corruption is inevitable. How we handle this corruption when it happens determines whether we truly lose data, or successfully restore from secondary backup. In this post we'll be showing you how at Datto we intentionally cause corruption in our testing environments, to ensure we're building software that can properly handle these scenarios.
In this post, we describe how we moved from Debian-based deployments in our fleet of >80,000 devices to image based upgrades. We show the nitty gritty details of how we use Grub and loop devices to boot from image to image seamlessly, every two weeks.