We all know that the movement to cloud infrastructure and cloud hosted services has been growing rapidly for a long time. The pandemic has only accelerated that growth. Here at Datto, our SaaS Protection service protects SaaS offerings such as Microsoft M365 and Google Workspace (formerly G Suite). Oftentimes with rapid growth comes significant change and occasionally instability as well. Even for companies like Microsoft and Google, who operate at hyper scale, this is a really hard problem.
MySQL is a really mature technology. It’s been around for a quarter of a century and it’s one of the most popular DBMS in the world. As such, as an engineer, one expects basic features such as replication and failover to be fleshed out, stable and ideally even easy to set up. And while MySQL comes with replication functionality out of the box, automated failover and topology management is not part of its feature set. On top of that, it turns out that it is rather difficult to not shoot yourself in the foot when configuring replication. This is a blog post about setting up lossless MySQL replication with automated failover, i.e. ensuring that not a single transaction is lost during a failover, and that failovers happen entirely without human intervention.
Unless you're using the ACME protocol with a certificate authority such as Let’s Encrypt, you're probably well aware of the annoyance of certificate rotation. Here at Datto, we use certificates in many places with a validity period of around a year, depending on the Certificate Authority. Last February, we noticed that several production hosts were providing expired certificates for one of our major Internet-facing domains - a mistake that many other companies suffer from, as well. This caused several problems, and it was decided that after the issues were addressed, we needed to take a very proactive stance in monitoring certificates for all of our TLS-enabled services. I will not dive into the details about why the certificates weren't properly rotated, but rather, what we're doing from now on so this sort of issue never occurs again.
Over the past year or so, I’ve really been focused on fuzzing research and the different areas I could apply the techniques and tools I’ve come across/created. During this time, I decided to take a break mainly due to feeling burnt out and went back into web pentesting. While looking for some classes of web vulnerabilities, I focused heavily on XXE (XML External Entity) injection as an attack vector. In order to understand how PHP7 mitigates this class of vulnerability, I looked at the SOAPClient library for parsing returned XML data from a SOAP server. After some trial and error, I was able to identify a null dereference bug in the PHP SOAP library that resulted in CVE-2021-21702.
"What do you want?" A question I had to find the answer to in order to preserve my sanity and my career.
We take data protection seriously at Datto, which is why we’ve been increasingly using mutual TLS authentication to secure communications between components in our application stack. Our use of Hashicorp Vault has accelerated this security pattern, as Vault makes it easy to deploy and manage multiple CAs. Recently, we saw an increase in TLS-related errors for one of our mutually-authenticated application endpoints. In this article, I’ll walk you through how we debugged and resolved this problem. I’ll also take you on a deep dive into reproducing this issue, and I’ll hopefully teach you some fun OpenSSL commands along the way.
Recently the Datto SaaS Protection SRE team was met with a challenge to add authentication onto an open source web application that didn’t have a strong authentication story with it. We knew that we didn’t want to write an entire authentication layer just for this one application as the return on that time investment would be rather low. Instead we looked for a solution that would be easy to implement, easy to automate, and easy to understand months down the road long after the shine had worn off and was just another application we managed.
We’ve all had a hard drive fail on us, and often it’s as sudden as booting your machine and realizing you can’t access a bunch of your files. It’s not a fun experience. It’s especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure at Datto begins.
Upgrading thousands of servers is challenging and filled with uncertainty. This article describes how we leveraged Ansible to build automation that increases confidence in our upgrade process.
Rebooting Ubuntu is hard. I don’t really know why, but in my twelve years as an Ubuntu user, I’ve encountered countless “stuck at reboot” scenarios. Somehow, typing reboot always comes with that extra special feeling of uncertainty and the thrill of danger. This post describes the short story of how we managed to make Ubuntu machines reliably reboot using Linux watchdogs.