In our previous post, we went over two of the most common problems machine learning engineers face when developing a model: underfitting and overfitting.
We saw how an underfitting model simply did not learn from the data while an overfitting one actually learned the data almost by heart and therefore failed to generalize to new data.
Now, how do you solve these problems?
Let’s start with the most common and complex problem: overfitting. There are 4 main techniques you can try:
- Adding more data
Your model is overfitting when it fails to generalize to new data. That means the data it was trained on is not representative of the data it is meeting in production. So, retraining your algorithm on a bigger, richer and more diverse data set should improve its performance. Unfortunately, getting more data can prove to be very difficult; either because collecting it is very expensive or because very few samples are regularly generated. In that case, it might be a good idea to use data augmentation.
- Data augmentation
This is a set of techniques used to artificially increase the size of a dataset by applying transformations to the existing data. For instance, in the case of images, you can flip images horizontally or vertically, crop them or rotate them. You can also turn them into grayscale or change the color saturation. As far as the algorithm is concerned, new data has been created. Of course, not all transformations are useful in every case. And in some cases, your algorithm won’t be fooled… In short, data augmentation can be a very powerful tool but it requires a careful examination and understanding of your data.
Regularization actually refers to a large range of techniques and we won’t list them all or go into details here. The main idea you need to remember is that these techniques introduce a “complexity penalty” to your model. If the model wants to avoid incurring that penalty, it needs to focus on the most prominent patterns which have a better chance of generalizing well. Regularization techniques are very powerful and almost all the models you will build will use them in some way.
- Removing features from data
Sometimes, your model may fail to generalize simply because the data it was trained on was too complex and the model missed the patterns it should have detected. Removing some features and making your data simpler can help reduce overfitting.
It is important to understand that overfitting is a complex problem. You will almost systematically face it when you develop a deep learning model and you should not get discouraged if you are struggling to address it. Even the most experienced ML engineers spend a lot of time trying to solve it.
Let’s now switch to the problem of underfitting. Here is what you can try:
- Increasing the model complexity
Your model may be underfitting simply because it is not complex enough to capture patterns in the data. Using a more complex model, for instance by switching from a linear to a non-linear model or by adding hidden layers to your neural network, will very often help solve underfitting.
- Reducing regularization
The algorithms you use include by default regularization parameters meant to prevent overfitting. Sometimes, they prevent the algorithm from learning. Reducing their values generally helps.
- Adding features to training data
In contrast to overfitting, your model may be underfitting because the training data is too simple. It may lack the features that will make the model detect the relevant patterns to make accurate predictions. Adding features and complexity to your data can help overcome underfitting.
Did you notice? That’s right! Adding more data is not included in the techniques to solve underfitting. Indeed, if your data is lacking the decisive features to allow your model to detect patterns, you can multiply your training set size by 2, 5 or even 10, it won’t make your algorithm better!
Unfortunately, it has become a reflex in the industry. No matter what the problem their model is facing, a lot of engineers think that throwing more data at it will solve the problem. When you know how time-consuming and expensive it can be to collect data, this is a mistake that can seriously harm or even jeopardize a project.
Being able to diagnose and tackle underfitting/overfitting is an essential part of the process of developing a good model. Of course, there are lots of other techniques to solve these problems on top of the ones we just listed. But these are the most important ones and if you manage to master them, you will be well equipped to start your machine learning journey!