Putting the devices you know and love into our cloud

What Datto does well

Datto’s traditional BCDR appliances, such as SIRIS and ALTO, come with the relief of cloud backups. In the event your hardware or virtual appliance isn’t accessible or is simply unusable, Datto has long offered cloud restores and virtualizations to keep your business afloat in our cloud. We’ve been doing this successfully for many years.

If you think about it, the job that a SIRIS does with respect to supporting restores locally is substantially similar to the job that our cloud node performs.

One of Datto’s ongoing missions is continually improving and advancing our cloud offerings. The SIRIS has proven itself over and over again to perform well with local restores, so why haven’t we put them in Datto’s cloud so that we can get the benefit of innovations & feature advancement that occurs on SIRIS “for free” in the cloud?

The idea of putting SIRISes in the cloud has been tossed around for many years; there just hasn’t been anything new to the cloud that prompted the powers that be to make this a reality. Rather than putting SIRIS in the cloud, we’ve been focusing on denser storage and improved hardware to make our customers’ cloud restores perform at their best.

What if everything in our cloud were a SIRIS?

Putting a SIRIS in the cloud meant we would need to offer a solution that allows our customers to back up their data without the use of an appliance, such as SIRIS or ALTO. Essentially this would mean a “direct-to-cloud” solution. At DattoCon ‘18, in Austin, Texas, we introduced a new product, Cloud Continuity. Cloud Continuity is Datto’s first image-based BCDR product that backs up directly to the cloud, without an on-premise appliance involved whatsoever. Cloud Continuity proved to be popular with our customer base. By the beginning of 2019 we had over 1500 machines being protected by Cloud Continuity. However, by the fall of 2018, we were able to identify a need to expand our cloud’s capabilities to match the feature set of the SIRIS product. The “Everything’s a SIRIS” team, which included myself and one other engineer at the time, was created to make this a reality. We were tasked with enabling backups to a SIRIS running on our storage node hardware, and smoothly converting a traditional Datto cloud storage node to a new Datto cloud SIRIS node. Cloud Continuity was the first product to move to our new SIRIS-based cloud. If a Datto customer was using Cloud Continuity through March of 2019, their backups were being sent to a traditional Datto cloud storage node. In a few weeks’ time we were able to upgrade all of those storage nodes to new SIRIS nodes — they probably didn’t even notice.

We were actually quite surprised at how simple it was to configure SIRIS-inbound backups from the Cloud Continuity agent. Just like our SIRIS, we needed to have redundancy as well. This meant being able to securely transfer ZFS data from one SIRIS to another SIRIS, from one datacenter to another. Datto utilizes an in-house data transfer mechanism called SpeedSync, an intelligent rsync and zfs-based offsiting tool, where all datasets are tracked at a vector level. The schema of the vectors are pretty basic: a basic auto-incrementing identifier, the identifier of the dataset itself, and a reference to the source and target machines.

These vectors are essentially source to target machine mappings, and many can exist even for a single dataset. Historically, however, the primary and secondary machine types were device to server or server to server, never device to device. We had to refactor parts of the SpeedSync project to allow device to device vectors to exist, for both primary and secondary replications.

Restoring data agnostically

The majority of Datto’s cloud is made up of traditional storage nodes holding thousands of customers’ data. We’ve only ever had communication from one API to another, but now we’re putting SIRISes in our cloud. Our next task was to figure out how we could perform cloud restores for these datasets living on a SIRIS in the cloud. Back in 2017 we started on what we call “Cloud API”, a service used to drive restores and has full knowledge of where a customer’s backup data was stored. However, it was designed to work with traditional storage nodes, and never had a reason to interact with the SIRIS platform, as SIRISes typically live outside of our datacenters. So the team’s next “epic” was to get Cloud API to talk to SIRISes in the cloud to perform restores of datasets that lived on these cloud SIRISes, without needing to change the API contracts that the Datto Partner Portal was using. Since datasets are being tracked with a vector, we are able to determine where a ZFS dataset exists in the cloud and what kind of machine that dataset exists on. This made keeping the API contract the same effortless.

The most simple Cloud API request to create a virtual machine in our cloud looks like:

{
  "assetId": 123,
  "snapshotTimestamp": 1568402268
}

That’s it -- two parameters. Asset ID is used to map to the datasetId I mentioned above, and the snapshot timestamp parameter is just a UNIX timestamp representing which backup you wanted to restore. Being able to map the asset to the vector was crucial. Next we had to hook up to a different API in the cloud.

Using two APIs to accomplish the same task

Traditional Datto cloud storage nodes are running an API called “Node API,” and our new cloud SIRISes are running the SIRIS platform API, which we call “SIRIS API.” Behind the scenes in Cloud API we can resolve the incoming request to the correct service. The simplest approach here was to use the factory pattern, given the resolved request, based on the type of secondary machine, return the correct API Service. Something similar to:

public function getService(Request $request): ServiceInterface
{
    $volume = $this->repository->getVolumeById(
        $request->getAssetId()
    );

    if ($volume->replicatesToCloudDevice()) {
        return $this->cloudDeviceService;
    }

    return $this->storageNodeService;
}

We created an interface from the existing storage node service to allow us to simply drop in this factory wherever we needed to differentiate between a storage node and a cloud SIRIS. This also enforced a contractual agreement when creating restores in either environment.

Now that we have these separate services we were able to create secure communications between Cloud API and Node API or SIRIS API, and keep logic in communicating between either separate from each other.

So far this process has actually been relatively simple; no true complexity has been encountered just yet. However, given the relatively small scale of Cloud Continuity at its time we thought this was a perfect opportunity to answer “can we convert a traditional storage node to a cloud SIRIS?”

Making “everything’s a SIRIS”

So what does it actually take to convert one software stack into a completely different software stack? We had to figure this out first before we could even develop a plan to do these conversions. Traditional storage nodes and SIRISes have completely different software stacks, API surfaces, and platform requirements, and differ in many small ways, even down to the names of the zpools. Even the type of webservers running on both are different! The only commonality between them was they were both using ZFS to manage snapshots and both used the same hypervisor stack (KVM + QEMU + libvirt).

(items in red are what changed between upgrades)

On traditional storage nodes, the datasets are grouped by a common identifier, this is not the case with SIRISes, since SIRISes are not multi-tenant, they just have many assets. This was one nuance we actually were constrained to: we did not want to change the dataset paths on the SIRIS. This was important because changing the dataset path for our cloud SIRISes would have meant we were changing code for all SIRISes, our customers’ SIRISes included. Renaming datasets with a large amount of snapshots in ZFS turns out to be an extremely expensive operation, meaning we couldn’t just rename the top level dataset and hope it worked.

This turned out to be one of the trickier problems we faced, but we solved it with grace when converting the existing Cloud Continuity storage nodes into cloud SIRISes. One of the problems with Cloud Continuity that made this especially tricky was that we didn’t have device identifiers, since the asset was being backed up directly from the protected machine to the storage node, not through a SIRIS, which would have a device identifier. Since we had the idea for Cloud Continuity storage nodes to eventually be the first cloud SIRISes, we had already created the top level dataset we needed.

Our approach for Cloud Continuity storage nodes was simple:

#!/bin/bash

if [ -z "$1" ]; then
    echo "usage: $0 <uuid>"
    exit 1
fi

set -xe
uuid="$1"
originDataset=...

origin=$(zfs list -Hrt snapshot -o name "$originDataset" | tail -1)

if [ -z "$origin" ]; then
    echo "NO SNAPSHOTS FOUND"
    # do not create dataset
else
    echo "SNAPSHOT FOUND, USING $origin"

    targetDataset=...
    zfs clone "$origin" "$targetDataset"
fi

We determined if a Cloud Continuity agent had a snapshot. If it didn’t, we ignored it, otherwise we cloned the last snapshot for that dataset into the new dataset. This allowed customers to continue taking backups without needing a new full backup. Since Cloud Continuity was still in beta during this process we were able to pull this off without any disruption to paying customers.

Making “everything’s an actual SIRIS”

So now we’ve defined a process for converting datasets, but we still need to convert the entire software stack to a SIRIS. This wasn’t particularly hard to accomplish since we could simply plug in a USB stick that contained the Datto Imager and re-image the box as a SIRIS, but we weren’t about to have the team travel hundreds of miles to the datacenter and do this. Thankfully our Cloud Engineering team has hardware standards for everything and each storage node is accessible via IPMI. We were able to use our existing Datto imaging process to convert these servers in place. The imager itself has a lot of different configuration options, one of which is to only image the OS and not recreate the zpool. Recreating the zpool in this situation would have ended up with Datto losing hundreds of terabytes of data.

Now that these boxes were running SIRIS software, we were finally able to convert the datasets from their old name to the SIRIS-accepted format. For each dataset on the box we ran the script above and were able to see people taking new backups very quickly.

An evolving process

I left out a few details about reconfiguring the networking, but the process to convert the storage node into a cloud SIRIS was roughly 40 detailed steps, including some “time killer” steps between long running steps, such as movie facts and making sure someone didn’t reboot the box (this happened once, so we made it an official step). For myself and others involved in the process, I printed out each checklist and got us some very nice fine-tipped pens so we could have the satisfaction of physically checking each box for each step as we completed it. We did this every storage node we converted. Overall the process took anywhere between 30 and 60 minutes per node, depending on how many agents were already on the machine. We finished the upgrade process over a few weeks as to not drain ourselves, and to ensure we could monitor the newly converted cloud SIRISes and their agents being backed up.

Today the process is much simpler since we can just image the hardware immediately as a SIRIS and configure it as a cloud SIRIS. The great part of having these cloud SIRISes be a SIRIS is that we get to use Image Based Upgrades. So when new features become available for your local SIRIS, the cloud SIRIS will also have that new feature. IBU in the cloud allows us to develop features quickly and not have to worry about coordinating two engineering teams to make the feature available for the SIRIS and the cloud. Additionally it makes supporting cloud restores easier for our Technical Support staff -- they only need to know how to support a SIRIS to effectively troubleshoot an issue on a cloud SIRIS. Overall, putting SIRIS in our cloud has taught us process is of the utmost importance. By defining each step and documenting every wrong turn we made, we were able to converge to an extremely detailed list of steps that ultimately lead to swapping rugs from underneath our customer’s feet without them knowing it even happened.

About the Author

Marcus Recck

I make the cloud more Datto.

More from this author