Why Dropout is Not Substantially Powerful for Regularizing CNNs
...and here's an alternative technique you should use.
In yesterday’s post, I shared CNN Explainer — an interactive tool to visually understand CNNs and their internal working.
Today, we shall continue our discussion on CNNs and understand an overlooked issue while regularizing these networks using Dropout.
If you haven’t read yesterday’s issue yet, it’s okay. This post does not require you to read that. You can read it after reading this post. Here’s the link: CNN Explainer: An Interactive Tool You Always Wanted to Try to Understand CNNs.
Let’s begin!
Background
When it comes to training neural networks, it is always recommended to use Dropout (and other regularization techniques) to improve its generalization power.
This applies not just to CNNs but to all other neural networks.
And I am sure you already know the above details, so let’s get into the interesting part.
The problem of using Dropout in CNNs
The core operation that makes CNNs so powerful is convolution, which allows them to capture local patterns, such as edges and textures, and helps extract relevant information from the input.
The animation below depicts how the convolution operation works.
From a purely mathematical perspective, we slide a filter (shown in yellow below) over the input (shown in green below) and take the element-wise sum between the filter and the overlapped input to get the convolution output:
Here, if were to apply the traditional Dropout, the input features would look something like this:
In fully connected layers, we zero out neurons. In CNNs, however, we randomly zero out the pixel values before convolution, as depicted above.
But this isn’t found to be that effective specifically for convolution layers.
To understand this, consider we have some image data. In every image, we would find that nearby features (or pixels) are highly correlated spatially.
For instance, imagine zooming in on the pixel level of the digit ‘9’.
Here, we would notice that the red pixel (or feature) is highly correlated with other features in its vicinity:
Thus, dropping the red feature using Dropout will likely have no effect and its information can still be sent to the next layer.
Simply put, the nature of the convolution operation defeats the entire purpose of the traditional Dropout procedure.
The solution
DropBlock is a much better, effective, and intuitive way to regularize CNNs.
The core idea in DropBlock is to drop a contiguous region of features (or pixels) rather than individual pixels.
This is depicted below:
Similar to Dropout in fully connected layers, wherein the network tries to generate more robust ways to fit the data in the absence of some activations, in the case of DropBlock, the convolution layers get more robust to fit the data despite the absence of a block of features.
Moreover, the idea of DropBlock also makes intuitive sense — if a contiguous region of a feature is dropped, the problem of using Dropout with convolution operation can be avoided.
DropBlock parameters
DropBlock has two main parameters:
Block_size
: The size of the box to be dropped.Drop_rate
: The drop probability of the central pixel.
To apply DropBlock, first, we create a binary mask on the input sampled from the Bernoulli distribution:
Next, we create a block of size block_size*block_size
which has the sampled pixels at the center:
Done!
The efficacy of DropBlock over Dropout is evident from the results table below:
On the ImageNet classification dataset:
DropBlock provides a 1.33% gain over Dropout.
DropBlock with Label smoothing provides a 1.55% gain over Dropout.
What is label smoothing? We covered it in this newsletter issue: Label Smoothing: The Overlooked and Lesser-Talked Regularization Technique.
Thankfully, DropBlock is also integrated with PyTorch.
There’s also a library for DropBlock, called “dropblock,” which also provides the linear scheduler for drop_rate
.
So the thing is that the researchers who proposed DropBlock found the technique to be more effective when the drop_rate
was increased gradually.
The DropBlock library implements the scheduler. But of course, there are ways to do this in PyTorch as well. So it’s entirely up to you which implementation you want to use:
Isn’t that a simple, intuitive, and better regularization technique for CNNs?
👉 Over to you: What are some other ways to regularize CNNs specifically?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!