Evolving Seat Management 2 for Microsoft 365
When I joined Datto, my team was close to code complete on an upgrade to our Seat Management experience for Microsoft 365 clients. This was an exciting update that included an improved UI experience with detailed seat status information and the ability to easily and granularly view and adjust which seats are backed up by SaaS Protection.
Behind the scenes, seat management changes included upgrading our technology stack and moving away from deprecated service APIs to the new Microsoft Graph API.
However, as with any large-scale software release, we learned valuable lessons along the way and as a result we continually adjust and improve the robustness of our product. Our most recent improvement dealt with how we interact with Microsoft APIs. While we’re declaring a success from our recent endeavors, we’re still learning more lessons and growing.
How does Seat Management Work?
At the end of the day, Seat Management is the system responsible for ensuring that Datto SaaS Protection knows about all of the remote services attached to seats for the tenants that we back up. This means discovering newly created seats, identifying changes to seats, and knowing when seats have been remotely deleted and thus should be archived in SaaS Protection (or when previously deleted seats have been recovered and are now active again).
We differentiate seats from services, because this allows us to associate multiple services to the same seat. An example of this association can be found in User seats, where a Microsoft User seat can have an Exchange and/or a OneDrive service associated with it. Protecting a User seat with both Exchange and OneDrive services will protect both services attached to this seat.
The following diagram shows our seat and service model:
There are four valid states for seats that we’ve discovered:
- Unprotected - This is a seat that we’ve discovered, but it’s not backed up. There are various reasons that a seat may be unprotected in SaaS Protection. For example, if auto-add is not enabled, newly discovered seats will not be automatically protected.
- Active - This is a seat that we’ve discovered, and we are actively backing up.
- Paused - This is a seat that we’ve discovered and backed up in the past, but the customer has decided they want to pause backups to this seat. This allows us to stop taking new backups for this seat, but we keep the previous backups for the seat. Unpausing this seat will resume backups for the seat.
- Archived - This is a seat that we’ve discovered and backed up in the past, but the seat no longer exists in the remote tenant so we can no longer take new backups.
Every night we kick off a seat discovery process for each of our tenants. This process entails the following flow:
- Discovery: We pull a complete list of seats on a tenant using the Microsoft Graph API, to give us a point in time view of all the seats across the tenant for Users, Shared Mailboxes, Sites/TeamSites, and Teams.
- Compute Seat Changes: We pull a complete list of seats from the same tenant from our database and compare the two lists to find newly created, changed, or remotely deleted seats.
- Process changes: For each change, we perform the following checks:
- For customers who have auto-add enabled, newly discovered seats result in a seat change to protect the seat. Before protecting the seat, we check to ensure that protecting this seat wouldn’t violate any license caps in place for the customer; if the license cap would be exceeded we don’t protect the seat but send an email notification to the customer.
- We store the updated seat change information locally - this includes updating our database and triggering backups for new seats and services.
- We update a search index so this seat and its services now appear in our UI for display and search.
- All seat changes are audit logged for our records.
Some pitfalls to how we handled Microsoft calls
One of the huge parts of Seat Management is the client code that accesses the Microsoft Graph API as our source of truth. When we designed the Microsoft interaction portion of Seat Management, we made some assumptions:
- There would be a manageable number of distinct 400-level HTTP responses from Microsoft’s API, that we could easily classify and bucket into a handful of different categories. For example, 404 errors indicate that a seat no longer exists.
- Unique 400-level HTTP responses from Microsoft’s API outside of these known buckets would be few and far between, and indeed exceptional cases - we took a hard line and failed the entire nightly seat discovery run when these happened so we would immediately be notified about the failures and we could resolve the errors going forward.
- It would be easy for us to investigate and appropriately handle these 400-level errors with a quick turnaround time. The end result would be a comprehensive API client that handled any output from Microsoft in the appropriate manner.
This turned out to be short-sighted for a few reasons.
First, there are a lot of different potential 400-level errors that can be returned from Microsoft. Some error codes and messages that aren’t explicitly/widely documented. Since we have a relationship with Microsoft, our typical course of action would be to open support tickets to enquire about these errors. This would result in Datto engineers having dialogues with Microsoft support engineers and sometimes even developers to identify the root cause for these errors. Once we found the root cause, the resolution would vary: sometimes we uncovered bugs that required a quick bugfix/hotfix on Microsoft’s side, other times we worked with Microsoft to come up with alternate paths to get the data we wanted, and sometimes we’d uncover fixes that were more involved and had to be put on the product roadmap - and we’d have to implement workarounds while we waited for the issues to be addressed at that level.
Regardless of the resolution, the turnaround time and mean time to resolve the issue as it related to our seat discovery process was significantly higher and more customer-impactful than we wanted. Furthermore, our team would sometimes hotfix workarounds for specific errors only to run into yet more, different errors as soon as the fix was in place! This started what effectively seemed to be a long game of whack-a-mole.
Second, while Microsoft does not change their API signatures, the specific text of an error code may change, for example to become more specific. Our solution of classifying based on error messages was too fragile and this type of messaging change would sometimes break our code, even for known errors.
Third and finally, our hard-line strategy of failing seat discovery at the first non-classified error meant that we’d fail the complete seat discovery process for that tenant. If you’ll recall, seat discovery involves comparing a full list of Microsoft’s seats that we could pull. If we can’t pull the list in its entirety, seat discovery fails - not only do we have no way of knowing how many seats we weren’t able to discover, we didn’t get to the part of computing seat changes - so no new seats were stored in our system, leading to inefficiencies in how long it took us to discover and back up newly added seats.
An idea and a fix! … with some fallout
In an effort to make our code as robust as possible, the appropriate fix was to turn our client-side error handling on its head. It didn’t make sense to take a hard line and fail on every never-before-seen 400-level error code, it made a lot more sense to extensively log new errors, move on and process the next seat, and so on until we successfully completed discovery of all the seats that the API made it possible to discover. An important aspect to this change is logging - when we see these errors, we still want engineers to investigate and work with Microsoft support to resolve these issues, and we need enough information to reliably follow up on these errors. However, we don’t want it to stop all progress, so continuing to discover all seats we’re able to makes sense.
This isn’t perfect - we still fail seat discovery when we encounter new 500-level errors, as these typically indicate more serious problems on the server-side. There are also certain 400-level errors that can cause our seat discovery as it stands today to fail - the most common error that causes this is 429 - Too Many Requests. When we get this error, it means that Microsoft’s API is handling too many requests and we need to back off - however, if we continue to see consecutive throttling errors even after backing off, we’ll fail seat discovery and try again another time.
When you write down this approach in a development ticket or a blog post, it makes a ton of sense - and we were able to articulate the clear business value of addressing the major pain point behind several customer-reported support issues all at once. Our goal was to always complete as much as possible, and continue to follow up with Microsoft support as necessary based on the prevalence of these errors and continuously improve our client-side code calling Microsoft.
While the benefit to this approach was clear, we couldn’t be sure of the total impact of this change. The “chicken and egg” problem comes into play here - how do you know how many seats you haven’t been able to discover, until you’ve discovered those seats? We were confident that this change would have a net positive impact on the total number of seats we discovered.
… Well, we were certainly correct in our intuition that this change would be impactful! After extensively testing the client side functionality, we rolled out this change in one of our releases. With 20/20 hindsight, the one thing we didn’t foresee was how this change would mix with the sheer scale of our operation. We discovered a lot of SharePoint sites. Let me repeat: A lot of sites. All at once. In a single day, we initiated backups for all of those sites with auto-add enabled. We increased our data ingestion rate by an amount which was noticeable enough to warrant attention from our network folks as it taxed our bandwidth.
I don’t want to understate the sheer number of hours that our engineering team spent resolving this issue - both development time implementing fixes as well as time engaging with Microsoft engineers via our long-standing relationship. We emerged from this lengthy process with a hard-fought improved understanding of Microsoft's APIs and workarounds, a code base that's significantly more resilient to failure, and we're doing the right thing by our customers by discovering as many seats as we can.
As with any major software change, we learned lessons on how to improve this next time around - not only as we design client-side API functionality for seat discovery for other applications, but also how to improve our deployment process going forward. One of the great parts about working on the Datto engineering team (we also happen to be hiring!) is that there are always interesting problems to solve.
We are continually improving the robustness and reliability of our seat discovery platform. Now that we’ve tackled the most egregious issues with our Microsoft API client strategy, we’re taking a hard look at other places where we can make impactful improvements.
Some of the items on our short list of ideas include mitigating the risk of fallout from big seat discovery changes, better handling of throttling errors when Microsoft indicates there’s too much API traffic, and improving scalability by saving what we’ve discovered when we encounter an error rather than failing outright and starting from scratch the next day.
There are several ways to attack the above solutions, including changes to our deployment process, up to and including changes to our overall approach. One of the biggest opportunities for another impactful improvement is to incrementally save our progress in batches, rather than failing and not continuing the first time we encounter a non-recoverable error. This would also make our entire software solution more scalable in general, which is always a good thing as we add larger and larger clients to our platform.
We’re also continually looking for ways to streamline the sheer volume of API calls we make to Microsoft to ensure we’re making the most efficient usage of our API requests. This helps us to discover seats as quickly and efficiently as possible, helps us avoid triggering throttling errors in the first place, and also allows us to be a more well-behaved API client when we don’t hammer Microsoft with too many concurrent requests.
Finally, as all of the software engineering challenges we’re solving also involve problems at significant scale, we’re continuously learning quite a few lessons as we go about how to ensure these changes don’t negatively impact our ecosystem as they’re deployed to production. It’s a learning process, and also an exciting environment to work in as an engineer.