We all know that the movement to cloud infrastructure and cloud hosted services has been growing rapidly for a long time. The pandemic has only accelerated that growth. Here at Datto, our SaaS Protection service protects SaaS offerings such as Microsoft M365 and Google Workspace (formerly G Suite). Oftentimes with rapid growth comes significant change and occasionally instability as well. Even for companies like Microsoft and Google, who operate at hyper scale, this is a really hard problem.
Unless you're using the ACME protocol with a certificate authority such as Let’s Encrypt, you're probably well aware of the annoyance of certificate rotation. Here at Datto, we use certificates in many places with a validity period of around a year, depending on the Certificate Authority. Last February, we noticed that several production hosts were providing expired certificates for one of our major Internet-facing domains - a mistake that many other companies suffer from, as well. This caused several problems, and it was decided that after the issues were addressed, we needed to take a very proactive stance in monitoring certificates for all of our TLS-enabled services. I will not dive into the details about why the certificates weren't properly rotated, but rather, what we're doing from now on so this sort of issue never occurs again.
We take data protection seriously at Datto, which is why we’ve been increasingly using mutual TLS authentication to secure communications between components in our application stack. Our use of Hashicorp Vault has accelerated this security pattern, as Vault makes it easy to deploy and manage multiple CAs. Recently, we saw an increase in TLS-related errors for one of our mutually-authenticated application endpoints. In this article, I’ll walk you through how we debugged and resolved this problem. I’ll also take you on a deep dive into reproducing this issue, and I’ll hopefully teach you some fun OpenSSL commands along the way.