Datto Engineering Blog

A peek under the covers of Datto SaaS Protection

We all know that the movement to cloud infrastructure and cloud hosted services has been growing rapidly for a long time. The pandemic has only accelerated that growth. Here at Datto, our SaaS Protection service protects SaaS offerings such as Microsoft M365 and Google Workspace (formerly G Suite). Oftentimes with rapid growth comes significant change and occasionally instability as well. Even for companies like Microsoft and Google, who operate at hyper scale, this is a really hard problem.

Lossless MySQL semi-sync replication and automated failover

MySQL is a really mature technology. It’s been around for a quarter of a century and it’s one of the most popular DBMS in the world. As such, as an engineer, one expects basic features such as replication and failover to be fleshed out, stable and ideally even easy to set up. And while MySQL comes with replication functionality out of the box, automated failover and topology management is not part of its feature set. On top of that, it turns out that it is rather difficult to not shoot yourself in the foot when configuring replication. This is a blog post about setting up lossless MySQL replication with automated failover, i.e. ensuring that not a single transaction is lost during a failover, and that failovers happen entirely without human intervention.

Proactively Monitoring TLS Certificate Deployment

Unless you're using the ACME protocol with a certificate authority such as Let’s Encrypt, you're probably well aware of the annoyance of certificate rotation. Here at Datto, we use certificates in many places with a validity period of around a year, depending on the Certificate Authority. Last February, we noticed that several production hosts were providing expired certificates for one of our major Internet-facing domains - a mistake that many other companies suffer from, as well. This caused several problems, and it was decided that after the issues were addressed, we needed to take a very proactive stance in monitoring certificates for all of our TLS-enabled services. I will not dive into the details about why the certificates weren't properly rotated, but rather, what we're doing from now on so this sort of issue never occurs again.

How I stumbled upon CVE-2021-21702 in PHP’s SOAP extension

Over the past year or so, I’ve really been focused on fuzzing research and the different areas I could apply the techniques and tools I’ve come across/created. During this time, I decided to take a break mainly due to feeling burnt out and went back into web pentesting. While looking for some classes of web vulnerabilities, I focused heavily on XXE (XML External Entity) injection as an attack vector. In order to understand how PHP7 mitigates this class of vulnerability, I looked at the SOAPClient library for parsing returned XML data from a SOAP server. After some trial and error, I was able to identify a null dereference bug in the PHP SOAP library that resulted in CVE-2021-21702.

Engineer to Manager, There and Back Again

"What do you want?" A question I had to find the answer to in order to preserve my sanity and my career.

Troubleshooting TLS Session Re-Use and Mutual Authentication in HAProxy

We take data protection seriously at Datto, which is why we’ve been increasingly using mutual TLS authentication to secure communications between components in our application stack. Our use of Hashicorp Vault has accelerated this security pattern, as Vault makes it easy to deploy and manage multiple CAs. Recently, we saw an increase in TLS-related errors for one of our mutually-authenticated application endpoints. In this article, I’ll walk you through how we debugged and resolved this problem. I’ll also take you on a deep dive into reproducing this issue, and I’ll hopefully teach you some fun OpenSSL commands along the way.

Integrating Okta authentication into Nginx reverse proxies

Recently the Datto SaaS Protection SRE team was met with a challenge to add authentication onto an open source web application that didn’t have a strong authentication story with it. We knew that we didn’t want to write an entire authentication layer just for this one application as the return on that time investment would be rather low. Instead we looked for a solution that would be easy to implement, easy to automate, and easy to understand months down the road long after the shine had worn off and was just another application we managed.

Predicting Hard Drive Failure with Machine Learning

We’ve all had a hard drive fail on us, and often it’s as sudden as booting your machine and realizing you can’t access a bunch of your files. It’s not a fun experience. It’s especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure at Datto begins.

Automated OS Qualification with Ansible

Upgrading thousands of servers is challenging and filled with uncertainty. This article describes how we leveraged Ansible to build automation that increases confidence in our upgrade process.

Reliably rebooting Ubuntu using watchdogs

Rebooting Ubuntu is hard. I don’t really know why, but in my twelve years as an Ubuntu user, I’ve encountered countless “stuck at reboot” scenarios. Somehow, typing reboot always comes with that extra special feeling of uncertainty and the thrill of danger. This post describes the short story of how we managed to make Ubuntu machines reliably reboot using Linux watchdogs.