The Datto systems engineering team manages over 8,000 Ubuntu servers in our environment, and that isn’t even counting ephemeral VMs that are created for development. Ubuntu major OS versions are on a 2 year release cadence, and performing major operating system upgrades can feel like the metaphorical painting of the Golden Gate Bridge. Even worse than the actual work of rolling out upgrades across our fleet is the uncertainty that comes with it.
Software engineering teams write a variety of tests to validate their software. Unit tests help prevent regressions right down to the function level, while full integration test suites can ensure that a change doesn’t have ripple effects on the broader environment. Conversely, systems administrators often lack robust, end-to-end integration tests for many of their workflows. Just consider how fraught with peril the patching process is at many organizations: it’s hard to establish confidence that an upgrade to a particular package won’t break some key functionality, and this results in a fair amount of trepidation during patch time.
We thought carefully about these problems as we began working on a strategy for upgrading to Ubuntu 20.04. We eventually settled on using Ansible to perform automated, end-to-end tests of certain infrastructure software and workflows. In this article, we’ll take you through our approach for building an automated OS qualification system.
What this looked like before
Before we harnessed Ansible as a testing toolset, our processes were a bit more manual. Although we created a test suite for in-house applications, the process involved maintaining a series of bash scripts. We did not have a portable framework for testing third party applications, so testing these packages involved manual installation and configuration.
Without a proper framework, much of these previous ad-hoc testing methodologies went undocumented. As anyone who has worked with Infrastructure as Code (IaC) can attest, having the code reviewable in source control gets you halfway home for coverage and documentation; commented code gets you even further. Without documentation, repeatability also becomes an issue. Of course, our infrastructure specific to Systems Engineering had more bare metal servers several years ago, and virtualization has aided in the velocity of our testing improvements, but testing is still slow and error-prone without repeatable tests.
As a result of manual testing, issues were sometimes discovered in the field. While these issues were quickly dispatched upon discovery, remediation is generally more difficult once in production, and downstream changes become subsequently more difficult to implement. For example, changing a variable name or a default state may require an overhaul of many different customizations after deploying to thousands of servers. Obviously, if we can catch that bug or change ahead of time through repeatable and iterative test suites, we can save time.
Why Ansible?
There’s no shortage of tools that we could have chosen to test our infrastructure: Chef, InSpec, Python, or even just a collection of shell scripts can be used to build confidence in infrastructure software and configurations. However, Ansible checked a number of boxes for us:
- It can handle orchestration and configuration management. We can use Ansible to spin up Openstack instances (orchestration) and correctly set up software under test (configuration management).
- Ansible has a great set of base modules. We can use shell and command modules to run raw commands on hosts, the URI module to interact with external APIs, and much more.
- Many members of our team have experience with Ansible.
- It’s very easy to learn for those without previous experience. This is great for newer members of our team, and it also encourages other teams to contribute their own tests to our codebase.
- Ansible playbooks are written in YAML and benefit from inline validation from tools, such as YAML linters.
Overall, Ansible’s simplicity, familiarity, and feature set made it a natural choice for building out an automated platform.
Our Implementation
We implemented our test suite using Ansible roles, with the directory structure described in the Ansible Best Practices guide. Each piece of software has a role, and each role has a few required playbooks.
$ ls
ansible.cfg housekeeping/ README.md site-dynamic.yml
group_vars/ openstack.yml roles/ site.yml
$ tree -L 1 roles/
roles/
├── salt
├── telegraf
├── zabbix
...
We use a set of playbooks in the housekeeping/ directory to set up the Openstack environment, spin up virtual machines, and tear them down when the automation is done running. The collection of task playbooks for each role can vary, but each role must have at least three specific task playbooks:
$ ls roles/telegraf/tasks/
install.yml main.yml test.yml
The Install Playbook
We use Puppet and our own custom repositories to handle software installation. Bootstrapping all of this can take time: repositories must be mirrored to our local mirrors, Puppet manifests need to be updated to support new major version releases, etc. In the meantime, we still want to test upstream packages. To work around this, each role has an install.yml playbook that can handle basic package installation in the absence of Puppet. The install playbook is only run if the user sets the run_install_playbooks parameter to true:
$ cat roles/telegraf/tasks/main.yml
---
- name: Run telegraf install tasks
include_tasks: tasks/install.yml
when: run_install_playbooks
- name: Run telegraf test tasks
include_tasks: tasks/test.yml
This results in a small amount of duplication between our Ansible test suite and Puppet environment, but this proved worth the velocity for testing our latest major OS upgrade without waiting on internal packaging and Puppet changes. In reality, the “duplication” is just a few Ansible tasks to install and configure our software. Once our internal repositories and Puppet code are set up correctly, we just set run_install_playbooks
to false and we can qualify a major OS exactly as it would be set up by Puppet.
The test playbook
All roles must also have a test.yml playbook, and this is where the real magic happens. These tasks perform realistic tests that ensure a piece of software or a given infrastructure workflow is working as designed.
The goal is a full integration test. We don’t just want to see that the software installed, an associated daemon is running, or other simplistic verifications. We want to simulate realistic conditions to ensure that our software is working properly. A few examples of tests that we wrote include:
- Leveraging the Ansible URI module to ensure that a host is properly secured by our endpoint protection platform by hitting the platform’s API
- Ensuring that some of our common Salt modules work correctly by using delegate_to and running jobs on our Salt masters
- Validating that our live kernel-patching works by installing a vulnerable kernel version, performing live patching, and ensuring that vulnerabilities have been mitigated
- Confirming that hosts show up properly in external monitoring tools, such as OpenTSDB and Zabbix. Again, we made use of the URI module to perform these interactions via the APIs that these tools expose
The main playbook
The main.yml playbook in each role ties the rest of the playbooks together. A simple main.yml just calls the install.yml and test.yml playbooks:
$ cat roles/telegraf/tasks/main.yml
---
- name: Run telegraf install tasks
include_tasks: tasks/install.yml
when: run_install_playbooks
- name: Run telegraf test tasks
include_tasks: tasks/test.yml
More complex test setups may call additional playbooks to do post-test teardown (e.g., delete a host from the monitoring system), or it may split up the testing portion into additional playbooks.
With all of this role-based infrastructure in place, performing a full qualification of a new major OS version with our key software is as simple as running our site.yml:
ansible-playbook -i openstack.yml --ask-vault-pass -e @group_vars/secrets.yml site.yml
This will spin up an instance in our Openstack environment, run qualification tests against all of our software, and then teardown the environment. We use Ansible Vault to store sensitive variables, such as API tokens, right in our codebase. Assuming that everything works, the playbook will run to completion without issue and will confirm that all of our tools and workflows work on the given OS version.
Beyond Basic Software
Outside of testing the functionality of infrastructure software and workflows, we thought it might be valuable to write tests to validate our more common infrastructure configurations. One of the more popular combinations of open source software we use at Datto is HAProxy and Keepalived, providing a robust failover solution for many of our in-house applications.
Like anything open source, there is the ability to integrate customized scripts for handling specific use cases. The other notable attribute of open source software is the often copious number of feature additions and bug fixes, depending on popularity of the upstream repo. When a package is released into the stable repositories of our favorite Linux distribution, there are often feature changes that can either supersede or collide with in-house customizations, so it is essential to be able to spin up test environments to validate that these updates won’t break our configurations. Through the power of both Ansible Cloud Modules and Jinja templating, we are able to quickly spin up and configure multiple testing environments that would normally take weeks if not months to configure and set up by hand.
In the case of HAProxy and Keepalived, we typically take advantage of custom tracking and dynamic routing scripts for handling various failure scenarios. Some of these environments can be quite complex, so the amount of control variables will correspondingly increase when setting up testing environments. During testing between versions, we discovered that some features that we previously had to script were now included in the upstream versions of Keepalived, such as process and virtual ip (VIP) tracking. Using Jinja templating, we can now easily create dynamic configurations to match any scenario desired in our test environments. In our testing roles, we also set a variable to control the installation of HAProxy alongside Keepalived. For example:
{% if 'haproxy' in ansible_facts.packages and ansible_lsb.major_release | int > 18 %}
vrrp_track_process track_haproxy{
process haproxy
weight 2
}
{% endif %}
{% if ansible_lsb.major_release | int < 20 %}
vrrp_script checkVip {
script "/usr/local/sbin/checkVip.sh"
interval 3
weight -2
}
{% endif %}
In the above example, you can see that if HAProxy is present on the host, a new configuration stanza will automatically be created to track the health of the “haproxy” process, and if the lsb release is less than 20.04, our legacy VRRP script is added to check VIP health (VIP tracking, as of this writing, is now included by default in the latest upstream version of Keepalived). You can see that the possibilities are endless and allow for a multitude of testing variations. Given the variety of configuration permutations we have in our environment, being able to quickly build and test different configs is essential.
By combining Jinja templating with Ansible Cloud modules, we have some serious horsepower at our fingertips; but it does not stop there. Since we are interested in the dynamic creation of as many instances as necessary to simulate production environments, Ansible’s dynamic inventory features provide the optimal solution. Simply provide some variables and voila:
- name: Create {{ instance_count }} instances for testing
os_server:
name: use1-{{ host_pattern }}-{{ item }}
image: "{{ openstack_image_id }}"
key_name: "{{ openstack_key_name }}"
flavor: "{{ openstack_flavor_id }}"
nics:
- net-id: "{{ openstack_net_id }}"
security_groups:
- "{{ openstack_cloud_name }}-common"
- "{{ openstack_cloud_name }}-VRRP"
- "{{ openstack_cloud_name }}-testport"
availability_zone: "{{ openstack_availability_zone }}"
timeout: 300
state: present
loop: "{{ range(1, instance_count + 1) | list }}"
register: newinstances
This allows us to spin up as many on demand Openstack instances as needed. We can then add our hosts to groups using Ansible’s built-in registered variables, and create subsequent tasks for those dynamic hosts without ever needing to specify exact hostnames in the playbooks:
- name: Add hosts to in-memory inventory and dynamic groups
add_host:
name: "{{ item.openstack.name }}"
groups: newinstances_all
ansible_host: "{{ item.openstack.accessIPv4 }}"
loop: "{{ newinstances.results }}"
- hosts: newinstances_all
gather_facts: no
become: yes
remote_user: ubuntu
tasks:
- include_tasks: housekeeping/tasks/instance-setup.yml
Using Ansible’s conditionals (like those in the above Jinja2 templates) along with dynamic inventory and Ansible’s orchestration capabilities allows us to quickly build test environments with our commonly used configurations. We have been able to identify breaking or deprecating changes between versions of common software that we use, and we expect this capability to grow as we add additional tests over time.
Results
While we don’t have much in the way of hard data from OS qualification in previous years, we can say a few things with certainty:
- Previous efforts were ad-hoc, with no clearly defined success criteria
- Qualification work could technically take months, as an issue may not be discovered until an OS version was deployed into production and something didn’t work
- There was little, if any, documentation of testing procedures
Our approach using Ansible addresses all of these problems. We can now qualify an OS for our most important software in about 45 minutes, instead of months. Both testing procedures and success criteria are clearly defined and documented inside of our Ansible playbooks. As we inevitably find issues during a production rollout, we can incorporate those back into our testing suite so that they are caught in the future.
Ansible has proven to be a notably good tool for this job. It was easy for less experienced members of our team to pick up and begin writing tests with, and we hope that its approachability will help encourage others to write their own tests. Modularizing our playbooks into clearly defined roles, and using Ansible features such as tags, allows us to selectively run tasks if we only want to qualify a particular piece of software.
Interesting Finds
By alleviating the burden of manually setting up test environments, we were able to focus more on the software we were validating and observing less obvious differences between versions. For example, we noticed that we could not use the exact same steps in the Ansible testing playbooks between Keepalived versions and needed to introduce delays between restarting services for proper server quorum to occur between the active and passive nodes. We were also able to quickly determine if older custom scripts would continue to function and coexist with newer feature sets. As another example, we found that the VIP and process tracking built into Keepalived 2.0+ will give us more flexibility for future deployments, allowing us to retire some of our aging helper scripts.
The other interesting side effect of using open source software is the communal process of experiencing and resolving bugs first hand. We did hit a few outstanding bugs in Ansible while working with the Openstack integration. After the initial frustration of not understanding why a particular module was misbehaving came the rewarding experience of finding the solution on a GitHub thread posted by a fellow open source enthusiast, complete with the recommended diff to provide a quick workaround.
Conclusion
Major operating system upgrades have been a pain point at Datto. Lack of comprehensive, end-to-end testing of our internal infrastructure workflows resulted in bugs being discovered after an upgrade has been completed in production. An absence of good documentation and procedures for discovering issues prevented knowledge from flowing back into the team, and we would have to repeat a series of brittle, manual tests for each upgrade cycle.
Our new approach gives us a way to reliably qualify a major OS release version in just a few minutes, and it was pleasantly simple to implement. Using Ansible’s best practices allows us to easily extend the test suite as we discover the need for new edge cases, or even as we encourage other infrastructure engineering teams to contribute their own tests. As we look toward a world where immutable infrastructure is the norm, this type of testing done at image-build time will become increasingly important.
Thanks to everyone on the team who helped contribute to our effort: Anthony Critelli, Liam Curtis, David Wiesmore, and others. Interested in working on automation projects like this? We’re hiring.