The Public Key Infrastructure (PKI) is one of the most critical—and often least understood—components of the Internet. This infrastructure enables you, me, and everyone else to securely conduct business transactions safely online with full confidence that we are communicating with the correct entity. This is achieved through many different technologies and systems but the area we’d like to discuss today is public key certificates. In this post we will discuss how this technology is used within Datto to establish trust between our fleet of Datto devices and the Datto applications that run on protected systems. We will also explore the software and systems we use to seamlessly transition our entire fleet of devices and applications to a new set of public key certificates as those certificates expire, with zero noticeable impact to our customers!
To start, let’s talk a little bit about how PKI is used to protect Internet traffic in the most basic sense. When you access a secure website, you can be sure that you are communicating with the exact entity that you are intending to communicate with because your web browser does a few things before it allows you to visit that website. When you first navigate to a secure website your web browser receives a public key certificate from the web server hosting that site. This certificate includes identifying information about the server such as any domains that can be accessed via that server. The certificate also includes a cryptographic signature created from the private key of some known and trusted entity referred to as a certificate authority (CA). It is the responsibility of the certificate authority to ensure that the purchaser actually owns the domains that are referenced within the certificate before it adds its signature. This validation process occurs when an entity purchases a certificate from a CA. There are many of these certificate authorities, and web browsers and operating systems package the public keys associated with these entities into lists of certificates called trusted root stores. Trusted root stores are one of the many reasons to keep your operating system and web browsers up to date. These stores are often updated as certificate authorities issue new root certificates due to old certificates expiring or becoming compromised by malicious entities.
When your browser receives the public key certificate from the server, it validates that the certificate is signed by one of the trusted certificates in its trusted root store (this could be the operating system’s trusted root store or a proprietary browser based trusted root store). It then validates that the domain or IP address you used to access this server is listed in the certificate provided. Because the certificate is cryptographically signed by a trusted third party (the certificate authority), it is virtually impossible for the list of acceptable domains to be forged. As long as your browser has validated the certificate’s signature and you have not accepted any security exceptions for the website, you can be sure that the web server you expected to connect to is the one who owns the domain you used to access the site. One thing to note is that we have oversimplified this explanation just a little bit. Intermediate certificates usually come into play during the certificate signature validation process but a discussion of chains of trust is out of scope for this article.
Common Certificate Authorities include companies like DigiCert, Verisign, or Let’s Encrypt. However, these entities are only allowed to issue certificates for public IP addresses and domains for which companies and entities can prove ownership. So how can a company like Datto ensure safe communication between devices and software with private IP addresses living within private networks? Before we answer that question, let’s discuss a little bit about the Datto ecosystem which allows us to protect the data on end user systems.
Our customers use Datto technology to protect the data on their critical infrastructure by installing a piece of software called an agent on these end user systems. As can be seen in the diagram below, these systems are contained within the customer’s network which is entirely outside of Datto’s control. They can be, and often are, behind firewalls or NAT servers and do not necessarily have public IP addresses. In order to allow the quick transfer of data between customer systems and the Datto backup appliance the backup appliance is also deployed within that same network. To ensure that an agent can safely send its data to our device, we use a system of public key certificates and our own private certificate authority to establish the trust between our devices and the agent software running on protected systems. To do this we maintain our own certificate authority which manages a trusted root certificate. This certificate is used to sign both client and server certificates which are then used to encrypt the data flowing between our appliances and end user devices. Because the trusted root certificate is used to validate communication between our appliance and agent applications, it must be distributed to all of these devices and updated anytime it is compromised or becomes expired. More on that later!
This certificate authority lives in our Datto Cloud within an application that has a public IP address and public domain. Its server certificate is signed by an actual PKI certificate authority which means that we can use the public key infrastructure to ensure safe communication of our deployed agents and backup appliances with this private certificate authority. When our agents and backup appliances are initially provisioned they receive their own public key certificate and a trusted root certificate from this Datto private certificate authority and these certificates can then be used to authenticate the communication between all of our systems.
Consistent with industry best practices, the trusted root certificate that we initially created was given a 5 year validity period, which meant that it would expire in early May of 2020. Roughly a year prior to that expiration we began designing a system that would allow us to transition our entire fleet to an updated certificate. Since Datto is responsible for protecting hundreds of thousands of end-user systems from disaster, our system for rolling out new certificates had to be implemented in a way that caused no interruption to our customers’ backups. With this strict requirement in mind, and a deadline that we could no longer control, the entire operation had to be orchestrated precisely.
The first thing we realized during our investigation into the scope of this problem was that the agent application, which runs on the end user systems that we protect, would have to be updated. Our initial implementation of this agent would contact our private certificate authority to receive the trusted root certificate upon startup of the service if, and only if, it did not already have one stored in its filesystem. This was great, but now we needed all of those systems which had already received and saved a trusted root certificate to update to a new one.
Next, we reviewed how our backup appliances handled expiration of our trusted root certificate. Our initial implementation checked every 10 minutes for expiration of its certificates, but would only download a new one if the current date was within a month of expiration. That was a great start, but we have hundreds of thousands of protected systems to update and not a lot of time in which to update them. Further, what would happen if all of our devices suddenly started hammering our certificate authority every ten minutes once we were within a month of expiration? Could it hold up? Maybe, but we were not willing to take that risk. Our solution was twofold.
Datto appliances have always implemented a check-in system which allows them to phone home to Datto’s central servers every 10 minutes. We updated that mechanism to check if the trusted root certificate on our private certificate authority was new and, if it was, have the device run a command to download the new certificate. This command would wait a random length of time before actually downloading the new certificate so that the distribution over time of devices hitting our certificate authority would be fairly smooth. This would allow us to push out a new trusted root certificate to our appliances without DDoSing ourselves.
Finally, we realized that even with the two changes described above, backups would still fail if the rollout of these certificates was not sequenced properly. If a backup agent was to receive the new trusted root certificate from our CA before the agent was updated, the backup appliance would reject any communications with the agent, and vice-versa. When you’re responsible for the business continuity of hundreds of thousands of end users in the event of ransomware or other disasters, failed backups are not an option. Our solution was to introduce a fallback mechanism on our backup appliances which ensures that backups cannot fail because of this certificate rollout.
With this solution we modified the software running on our back appliances to save all previous certificates received from our certificate authority as it downloaded new ones. We also modified the software to try all of these certificate sets in descending date order until one of the sets enabled the successful communication with a given agent! As part of this, we ensured that the cadence in which the agent checked for new certificates was much slower than the cadence that the backup device checked for new certificates and also released the new certificates to backup devices first. This ensured that our backup appliances would receive the new certificates first, that they would be able to utilize this new fallback mechanism in their communications to the agents, and that backups would not fail due to the certificate transition.
In order to ensure a successful transition to the updated certificate, it was critical that we monitor the rollout as it was proceeding. We accomplished this by having our backup devices report back to our cloud a variety of information, including an identifier representing the exact trusted root certificate the backup appliance had most recently received from the CA. Our devices also included a list of agents on that system that were falling back to the old certificate. This allowed us to quickly identify situations such as agents which could not update for one reason or another which, in turn, allowed our exceptional support team to reach out to partners who were responsible for these systems.
As early May approached, we closely monitored our analytics as the rollout of these certificates proceeded. Notwithstanding the fact that we had modified every single Datto backup appliance and every backup agent running on our customers’ servers, we measured no noticeable increase in either backup failure or support call volume. A risky operation that could have led to a disaster ended up with just kind of a shrug. And that is exactly the way we wanted it to happen. Go Datto!
The good news is that all of the work we accomplished over the last year will make the next transition even easier when our trusted root certificate expires once again!