Act 1 - Predicting drive failure && an introduction to machine learning
Drive prediction @ Datto
We’ve all had a hard drive fail on us, and often it’s as sudden as booting your machine and realizing you can’t access a bunch of your files. It’s not a fun experience. It’s especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure at Datto begins.
First and foremost, to make a prediction you need data. Hard drives have a built-in utility called SMART (Self-Monitoring, Analysis and Reporting Technology) that reports an array of statistics about how the drive is functioning. Here’s an abbreviated view of what that looks like:
Datto collects a report like this from each hard drive in its storage servers once per day. Each attribute in the report has three important numbers associated with it: value, thresh, and worst. Each attribute also has a feature named raw_value, but this is discarded due to inconsistent reporting standards between drive manufacturers.
Value: A number between 1 and 253, inclusive. The value reflects how well the drive is operating with respect to the attribute, with 1 being the worst and 253 being the best. The initial value is arbitrarily determined by the manufacturer, and can vary by drive model.
Thresh: A threshold below which the value should not fall in normal operation. If the value falls below the threshold, there is likely something wrong with the drive.
Worst: A record of the lowest value ever recorded for the attribute.
A quick approach to using these values to make useful predictions is to pick a couple of attributes that seem important and make an alert if any of their values pass below the associated threshold. This is how the first iteration of predictive drive failure at Datto worked. It wasn’t perfect, but it was definitely better than nothing!
The next iteration of drive failure prediction was assigning a weighted health score to each drive. This score was defined by assigning weight to several different attributes based on how severe they appeared to be, then adding them together. This prediction method was better than its predecessor, but could potentially be improved even further.
This brings us to the most recent iteration of drive failure prediction at Datto and the topic of this article: smarterCTL, a machine learning model using most attributes reported by SMART to make the most informed prediction possible.
Smartctl is the command line utility used to collect SMART reports. SmarterCTL is a machine learning model making use of
smartctl, so it’s the smart-ER version of it. Yeah, it’s a bad pun, but I’m an engineer not a comedian.
Machine learning && SMART
Before we get into the details of smarterCTL, let’s briefly go over machine learning. At its core, machine learning is the process of applying statistics to a dataset to find patterns in it. Once the patterns are found, they can be applied to new data to make assumptions about what that new data means. The defining feature of machine learning is that the programmer doesn’t have any input in figuring out the patterns in the data; the algorithm does that on its own through trial and error. For a great high level introduction to how machine learning works, check out this two minute video.
To get a feel for workflow and terminology, let’s walk through a simplified application of machine learning on drive failure prediction. This example glosses over details and makes some leaps of logic to keep the scope broad. For a more accurate look under the hood of decision tree-based machine learning models, check out this article and the other resources linked throughout this section.
To make predictions, we need a dataset to train the model on. In this case, the data points are hard drives and the features of those data points are the attributes SMART provides.
These data points are labelled as members of the negative or positive class, which in this case means “hard drive operates normally” or “hard drive has failed.” Note that the “positive” in “positive class” doesn’t mean “good.” Instead, it means “this sample exhibits the behavior we’re looking out for.” A machine learning model would read this dataset, then look for patterns in the features that determine why each hard drive ended up in its class.
Normally there would be enough data points and features that a human couldn’t read the whole dataset—let alone spot a pattern in it! This example is simplified enough for us to step through the process that a model might follow. Let’s look for a pattern in each feature:
Nothing to see here, neither class is homogeneous when it comes to seek_error_rate.
Again, we can’t make a determination based on this attribute alone. There isn’t an obvious split, like high numbers being good and low numbers being bad.
This time it looks like there is a pattern! Drives with high power-on hours are healthy and drives with low power-on hours will fail. This doesn’t make logical sense though—a drive with low power-on hours should be the healthiest, since it’s the closest to mint condition. Let’s look a little deeper, and see if we can find a correlation between this feature and another that tells us something more logical.
A-ha! There’s still a difference in magnitude between the two classes, but this time there’s an explanation for it that makes sense: an older drive with a high spin-up time is aging and degrading normally, while a very young drive with a high spin-up time indicates that there might be something like a factory defect.
Now that we’ve figured out the pattern, let’s see how our theoretical model would classify these new SMART reports:
The ratio of spin-up time over power-on hours for drive X is 0.0002, which indicates that it will remain healthy. The ratio for drive Y is 0.045, pointing toward failure.
So how do we know if the model made correct predictions? Well, we just have to wait and see. One of the trickier parts of this problem space is verifying results, because the whole point is to allow us to take action before the thing we’re predicting ever happens. Keep this in mind, it’ll come back to bite us later.
SmarterCTL’s job is to classify hard drives as “failing” or “not failing” so Datto can avoid being blindsided by lost drives and preemptively swap them for healthy ones. If there are patterns in SMART stats that indicate drive failure, smarterCTL will learn those patterns from SMART data Datto has collected over the past 3 years. Then smarterCTL can monitor new daily SMART reports and produce an alert when a drive is exhibiting a pattern that indicates failure.
Act 2 - Data preparation
We have several hundred gigabytes of SMART files spanning 3 years and a relatively beefy server to train a machine learning model on—the next step is to set it up and let it churn through the data, right? Unfortunately the work has just begun. Even though machine learning models have a reputation for taking days, or even weeks to train, preparing the data is often the most time consuming part of the process!
Software folks know that computers are “dumb” and will only do exactly what they’re instructed to. Therefore, the data fed into a machine learning model needs to be very carefully curated. The model has no intuition about whether the conclusions it’s coming to are logical or not, so data that doesn’t accurately describe the whole problem space will lead to a model that confidently spouts nonsense.
In machine learning, the job of the programmer isn’t to spot the patterns in the data: it’s to spot the antipatterns that the model might fall into along the way. Good data treatment and preparation are the first steps in avoiding those antipatterns.
Shaping the data
The first step is turning the human readable SMART reports into something more machine readable.
I condensed all of the data in each day of smart reports into its own csv file, but some key information was still missing. Since this data will be used to train a model, every data point needs to have a class label associated with it—the model needs to know whether these are healthy or failed drives so it can start to learn about potential patterns.
Producing the class labels ended up being a little tricky—if a drive is in poor enough health to be a member of the failing class, it might already be failing badly enough to not report SMART stats. To get a clearer idea of how drives were behaving and when they were failing out, I wrote a script that walked through every day of SMART reports and tracked how and when each drive failed.
Our servers use the filesystem ZFS, which pools together storage devices into groups called “zpools”. The zpool reports a health status for each drive in the pool. When my script comes across a failing zpool status for a drive, it notes on which day the failure occurred then goes through every daily csv and adds the following features to that drive’s row:
- Whether this drive ever fails (class label)
- If so, in how many days from this csv’s date the failure would occur, and
- The zpool status associated with the failure
With the addition of those features, our data now looks like this:
At this point the dataset is a collection of csv files, one per day, that contain:
- One row per drive
- The SMART data reported by each drive on that day
- Whether each drive is currently failed
- Whether each drive ever failed while we tracked it
Seems like enough data to start making predictions, right? Well, kind of! At this point, the data is well-formed enough that a model could understand it and start producing predictions. The quality of those predictions would be poor, though; the data needs a lot more treatment to remove the pitfalls lurking within.
Cleaning the data
Right now, the data is well formed but messy. It’s in the right shape, but it’s full of missing values and incorrect typing. Here are some real samples from the first iteration of the dataset:
Unless the data is cleaned, the model will totally trust that
“No device found, FAILED to get smart data!” is a valid hard drive size. To clean the data, the malformed values need to be sniffed out and replaced with one standardized NaN (Not A Number) representation. Once the missing values are standardized, the model can handle them appropriately.
The random error values are gone, but in the 1_val column alone over 1 in 10 entries are NaN. Even though the model knows to overlook those values, having so many unused features can actually skew the results. The non-NaN features associated with a data point will have stronger relative weights when many others are missing. There are two ways to approach this problem: deleting the rows with missing data or recreating the missing data.
If there aren’t too many rows with missing values, it makes sense to drop them. There won’t be too much data loss, and we can be completely sure that the data isn’t skewed by NaNs.
If there are many rows with missing values, first and foremost that’s a red flag that the data may not be complete enough to move forward. If that concern is addressed and the data is sound, then the NaNs can be recreated by imputation: inferring what the missing value would likely be based on other values related to it. There are several ways to do this, and figuring out which is best for any problem is a process of trial and error.
What ended up being the right path for smarterCTL was a combination of the two approaches: dropping the rows that had tons of missing values, but keeping and repairing the ones that only had a couple.
Making the data consistent
At this point the data is nicely shaped and super clean, which seems like a signal to move forward to predictions. Unfortunately, hard drive manufacturers decided to throw a wrench in that plan. While SMART attribute names and SMART report formatting follow a consistent standard across the board, the values associated with the attributes vary by brand.
These are both fresh-off-the-line drives with the default “no errors” value for this attribute. The numbers are very different because they’re produced by two different manufacturers.
A machine learning model doesn’t have the intuition to know that while these values are referring to the same attribute, they are on different scales. It’s like asking someone to describe the difference between something that weighs “10” and something that weighs “40” without telling them that the 10 is in pounds and the 40 is in grams. It’s misleading, and will lead to the model massively favoring certain manufacturers over others when predicting failures.
To take the difference in scale out of the equation, I split the dataset into many subsets where each subset had only drives from one manufacturer. This worked great for keeping the data consistent, but it introduced a new problem: most of the datasets were too small. We’ll get into that later in Act 4.
At this point, nothing has been added to the data; we haven’t computed new features or made new data points. What we have done is get the data in a clean, consistent state that can be used to train a model without falling into any obvious traps.
Act 3 - Training the first model && initial results
Preparing the data (again)
Finally it’s time to boot up our machine learning library of choice, XGBoost, and let it rip! Just kidding, there’s actually even more data processing to do first. The data needs to be split into a set of features and a set of class labels (known as X and Y), then into training and testing sets.
It’s important to split X and Y to ensure that the class label doesn’t influence predictions—otherwise the model would be able to “cheat” and get the right answer every time without figuring out any real patterns.
The training set is 30-40% of the full dataset, selected randomly. It’s what’s given to the model to figure out the patterns in the data. During training, the model will have access to the class labels to “check its answer” and see if its suspected patterns are correct.
The rest of the data comprises the testing set—this portion of the data is held out and kept secret from the model until after training is complete. After training, the model can make predictions using the testing set as input to get a sneak peek into how the model might perform in production on unseen data. Oftentimes testing sets are smaller, only around 20-30% of the data, but I prefer leaving more data for the testing set when working with imbalanced data (more on imbalanced data later!).
Holding out a testing set is really important! It’s the main way to check if the model is overfitting. Overfitting is when a model learns “too much” about the training set, and loses track of the overall trend it's looking for.
Training the first model
Now that the data is split into training and testing, it can finally be chucked into XGBoost and the machines can learn! At first we’ll just use the classifier model with default settings.
At this point,
classifier is our trained model. To check how accurate it is, let’s see what it says about the testing set.
This produces the following (simplified for readability) output:
Understanding the output
Let’s walk through what each of those numbers means.
Accuracy is the ratio of correct predictions over total predictions. At first glance, 99.92% accuracy looks incredible! Digging a little deeper reveals why accuracy is actually a misleading measure for this problem: the classes are incredibly imbalanced. There are 400,150 reports from healthy drives and only 215 reports from failed drives. If the model were to predict that every single drive would never fail, then that would still be 99.94% accurate while providing nothing useful.
Matthews Correlation Coefficient is a measure that can handle imbalanced data by taking into account the difference between true and false positives and negatives. MCC is a decimal value between -1 and +1. An MCC of zero is random prediction, +1 is perfect prediction, and -1 is perfectly opposite prediction. In this case an MCC of 0.1975 is low, but still better than random.
The confusion matrix is the breakdown of true and false positives and negatives.
The confusion matrix is a great look into where the model’s strengths and weaknesses are. In this case, the model is great at predicting true negatives; it got 400,000 of them correct. One of the model’s weak points is false negatives. Out of the total 215 failing drives in this sample, it incorrectly predicted that 175 of them would be healthy.
A model will never get every prediction correct, and the wrong answers will often skew to one side: many more false positives or many more false negatives. Which one a model should favor is a business decision. False negatives represent drives that can’t be preemptively swapped, wasting time and manpower. False positives represent healthy drives that will be swapped regardless, wasting money on new unnecessary hardware. Either could be favorable depending on the target audience, and the confusion matrix can be used to monitor how well the model is being trained to favor either direction.
Improving the results
The data is well formed and describes the problem space well, but the model is producing mediocre results. The first place for improvement is to move beyond the default settings and start tuning hyperparameters. A hyperparameter is an external configuration of the model; it’s something the programmer picks, not something learned from the data. Think of hyperparameters like tuning knobs on an instrument. Here’s an overview of how I tuned the hyperparameters that turned out to be most important to smarterCTL:
eval_metric: The eval_metric is the measure that the model uses to judge how accurate its predictions are while it’s training. By default, XGBoost uses error rate (inverse of accuracy) as the eval_metric. Any measure related to accuracy isn’t helpful when it comes to imbalanced data, so I replaced the eval_metric with a custom function that checks MCC instead.
early_stopping_rounds: XGBoost trains models in rounds. Every round of training, a change is made and the accuracy is compared to the last version. Without early_stopping_rounds, a model will train for an arbitrary amount of rounds then return the best iteration. With early_stopping_rounds, the model will stop training if accuracy on the test set hasn’t improved in X rounds. The goal is to stop training before the model begins overfitting, even though its accuracy on the training set might still be increasing.
scale_pos_weight: Scale_pos_weight is the ratio of positive samples (failing drives) over negative samples (healthy drives). Defining the scale_pos_weight helps improve accuracy when dealing with imbalanced data, because it lets the model know roughly what proportion of classifications should be positive. This metric can be set higher or lower than the real ratio of positive samples to encourage the model to favor false positive or false negatives. A very low scale_pos_weight tells the model to assume that the vast majority of classifications should be negative.
base_score: When making a classification, the model gives each data point a score between 0 (negative) and 1 (positive). The closer to 1 the more sure the model is that the classification is positive. Base_score helps handle imbalanced data similarly to scale_pos_weight, but does so through defining the default classification score to encourage the model to err toward either positives or negatives. A base_score of 0.1 tells the model to err on the side of assuming that any classification should be negative.
Some of these hyperparameter values were decided upon through trial and error, and some were figured out with the help of XGBoost’s cv (cross validation) function. Cv is essentially automated trial and error; it runs through several rounds of boosting using different hyperparameter values and returns the values from the most accurate iteration. Cross validation also provides some other important information relating to statistical significance, which can be read about here.
After fiddling with hyperparameters, the next model’s results looked like this:
Definitely better! MCC is significantly higher. Proportionately, true positives are up and false negatives are down considering the total amount of failures randomly sampled into this testing set. This is still a long way from being accurate enough to be useful though, so it’s back to data treatment.
Act 4 - Revisions && improvements
Working through different combinations of these data treatments was an iterative process that spanned a long period of time, and hundreds of different models were trained along the way. I didn’t keep every model so there won’t be accuracy summaries for each data treatment, but I will briefly go over what worked and what didn’t then we’ll reconvene in Act 5 to see how the final model turned out.
Splitting the dataset by manufacturer
Back in Act 2 we discussed how different manufacturers have different scales for the same SMART attributes. I addressed that problem by splitting the dataset such that drives from different manufacturers were in separate sets, but this introduced a new problem: the whole pre-split dataset already had too few members of the positive class, and splitting it would reduce that number even more.
Only one drive manufacturer had enough failed drives in the dataset to produce a large enough subset, which we’ll refer to as Manufacturer A. Because this data treatment was so effective, and because most of the drives that had been in our fleet long enough to begin failing were already from Manufacturer A, we decided to reduce the scope of smarterCTL to only their drives.
Before introducing this data treatment, the most accurate models had MCC scores between 0.2 and 0.4. After reducing the dataset to only one manufacturer, MCC scores reached 0.6 to 0.85.
Oversampling and undersampling
Oversampling and undersampling are techniques used to adjust the ratio between class sizes in datasets. For smarterCTL, oversampling would entail creating more data points of failed drives to make both classes closer in size. This can be done by copying existing data points or by producing new data points that are statistically similar to existing ones. Similarly, undersampling would entail discarding some of the healthy drives. This can be done randomly, or by following an algorithm to only remove statistically insignificant data points.
I used the python library imbalanced-learn, an offshoot of scikit-learn focused on handling imbalanced datasets, to test out several oversampling and undersampling methods. While some were better than others, none produced great results. I have a couple of guesses as to why:
- There’s enough variance within each class (a drive can fail in tons of different ways) that sampling methods skew the makeup of the class too much
- The minority class is too small to begin with, so even sampling can’t fix the problem
- Adding more data points to the failed drives class makes the model expect more drive failures, leading to more false positive predictions
In the end, using random oversampling combined with lowering the scale_pos_weight hyperparameter to combat false positives marginally improved results. MCC scores improved by about 0.05-0.075 using this technique.
Instance hardness threshold sampling
Instance hardness threshold sampling is a method of undersampling that removes the “hardest” samples. “Hard” samples are data points that the model has the most difficulty labelling as positive or negative. This article is a great writeup on IHT for a more detailed explanation.
I didn’t actually use IHT to resample my dataset, but it did teach me some important things about how my model was classifying samples! I used IHT to find out which samples were the hardest, and learned that there were tons of instances of drives failing without any change in SMART stats in the days leading up to failure. These instances were really hard for the model to classify because a lack of changes generally indicates no problems.
To give the model a better shot at differentiating the classes more clearly, I culled all of the “asymptomatic” failures from the training set. This meant that I had even fewer positive samples to work with, but there would be no way to catch asymptomatic failures ahead of time no matter how accurate the model anyway. After pruning the dataset in this way, MCC scores increased by about 0.1.
Act 5 - Final results && looking back
The final models
After all was said and done, I produced two final models. Both focused on hard drives produced by Manufacturer A. One was optimized for the highest MCC and the other was optimized for the lowest false positive rate.
On paper, these looked great! The high MCC model had solid predictive power with an MCC of 0.848, with the tradeoff of a moderately high false positive rate. The low false positive model had barely any false positives, with the tradeoff of a higher false negative rate and lower MCC. Both of these models passed the test of a large held-out testing set and had promising cross validation results. Unfortunately, once they were scaled up to production levels of use, their tradeoffs turned out to be too much to bear. These models are good—especially given how imbalanced the data for this problem is—but with too many false positives from one and too little predictive power from the other, they’re not worth betting so much money on. In a production setting these models would make thousands of predictions a day, so even small flaws would quickly add up to several thousand dollars of wasted hardware and person-hours.
What went wrong
So, why weren’t these models better? I don’t know for sure, or they would be better! Here are my ideas:
1: Lackluster technology and/or implementation
Either XGBoost wasn’t the right tool for the job or I implemented it poorly. I don’t think this is the most likely option, but it’s absolutely not ruled out. XGBoost is an incredibly powerful algorithm, but it is true that it struggles with imbalanced data. Imbalance-XGBoost is an offshoot of XGBoost that is built to handle imbalanced data. I explored this option, but at the time it didn’t seem to be much better than vanilla XGBoost. Maybe there’s some winning combination of Imbalance-XGBoost and imbalanced-learn that I didn’t figure out!
2: Not enough data
This is definitely a contributor to smarterCTL’s shortcomings. The dataset contained nowhere near as many failures as it did healthy drives, so the usable sample size was actually small despite having 3 years of data. Hard drives just fail too rarely to produce a substantial dataset in a reasonable amount of time. Specifically, hard drives don’t fail in useful ways often enough. To be a valid data point, the failed drive needs to:
- Not error out when providing a SMART report
- Live long enough before failing to provide baseline data
- Fail in a way that is reflected in SMART
On top of that, enough drives need to fail from each manufacturer to produce a manufacturer-specific dataset in order to get around the inconsistent value scaling problem from back in Act 2.
Imbalanced data problems are really tricky, and the best way around them is to collect enough data that the excess majority class samples can be thrown out to make the dataset balanced. If we did that with our current amount of data, the biggest manufacturer specific dataset would only have around 1000 data points.
3: SMART stats just aren’t the way to go
Unfortunately this option is looking the most likely, for several reasons. Predicting failure using SMART stats is a cool idea and sounds logical, but there might not be enough predictive power in SMART to make anything better than “pretty good.”
Asymptomatic drive failure
The first indicator of this is how many drives fail “asymptomatically.” I discovered that in our fleet, depending on the sample, 35-45% of drives fail with no change at all in their SMART stats.
It appears that this isn’t unique to our fleet of drives, because during a field study conducted by Google, 36% of drives in their sample failed without any changes in SMART stats.
Either there are things that cause drives to fail that aren’t reflected in SMART stats or SMART stats don’t update reliably. It’s likely a combination of both, and that makes it almost impossible to determine when SMART stats should be trusted to accurately reflect the reason for failure.
Another problem using SMART stats as the sole data source is that predicting drive failure in this way isn’t a true binary classification problem. Classifiers work best when there is a clear distinction between classes.
For smarterCTL, the negative class is clear. A healthy drive is a healthy drive, and is accurately described by its class label. The positive class is a lot less clear. Two drives can fail in completely different ways for completely different reasons—lumping them into the same class makes it harder for the model to figure out exactly what’s going on.
This is another problem that could be solved with more data: different types of failure could be broken out into different datasets or different classes. The smaller the scope of a model the more accurate it will be. Unfortunately, we don’t have enough data to have anything less than a broad scope.
Finally, other folks have tried to tackle this same problem and come to similar conclusions. Backblaze has been publishing SMART data from their drive fleet since 2013. They have more data and more manpower, but haven’t come up with a prediction algorithm more reliable than ”inconclusive” (at least that’s as much as their published work implies!). This isn’t Backblaze’s fault—they’re doing an incredible job. It’s just looking more and more like SMART stats might be the bottleneck in producing accurate predictors. Backblaze has come up with some awesome insights into what real world conclusions can be drawn from SMART stats, but it looks like those conclusions aren’t sure enough to bet the budget for a datacenter on.
Where do we go from here? I can confidently say that the problem space as it’s currently defined has been exhausted and we won’t get any better results without changing some things. SmarterCTL would be worth revisiting if:
- We collect lots more SMART data
- This alone likely won’t be enough, evidenced by how large Backblaze’s sample is and how similar their results are to mine.
- We collect other data in addition to SMART stats
- I’m not sure what other data is available that I don’t already have, but if any becomes available then it would be worth revisiting this to see if external information can add clarity to SMART stats.
- We refine the problem space
- If there’s a need to track very specific failure cases rather than all failure in general, then a machine learning model may have better luck. If, hypothetically, many 6TB drives from Manufacturer A started failing in similar ways, a new model could be produced to make predictions for only that situation. Narrowing the scope of the problem should increase accuracy.
Machine learning is really cool but it’s not a magic bullet. It can make sense of patterns in data and extrapolate them into predictions, but it can’t come up with inferences that aren’t soundly supported by the data. Machine learning is essentially the same as human learning, just really fast and with better memory—it can’t solve the unsolvable. SmarterCTL was a fun project that revealed a lot about how our fleet of hard drives have behaved and evolved, but it’s unlikely that SMART stats alone will ever be the key to predicting drive failure with a high degree of accuracy.