Multivariate Covariate Shift — Part 2

Dealing with some real-world problems of ML models.

Aug 01, 2024

Daily Dose of Data Science Free Book | Deep Dives

Yesterday’s post on multivariate covariate shift was appreciated by many of you. In that post, I left you with a question:

How can we detect multivariate covariate shift?

Today, let’s discuss this.

To recap, covariate shift happens when the distribution of data features (covariates) changes over time after model deployment:

It is a serious problem because we deployed a model that was trained on one distribution.

But gradually, covariate shift steps in and degrades the model’s performance because the production environment starts testing the model on another distribution:

Common approaches to determine covariate shift between training data and production data include:

Comparing their summary statistics — mean, median, etc.
Inspecting differences visually using distribution plots.
Performing hypothesis testing.
Measuring distances between training/production distributions using Bhattacharyya distance, KS test, etc.

But the problem is that these approaches can only detect univariate covariate shift — they only work on one feature at a time.

And as we saw yesterday, real-world models may encounter multivariate covariate shift as well, which is evident from the image below:

Blue → Training data; Red → Production data.

From the KDE plots on the top and the right, it is clear that the distribution of both features (covariates) is almost the same.
But, the scatter plot reveals that their joint distribution in training (Blue) differs from that in production (Red).

So how can we detect multivariate covariate shift?

To begin, it is important to understand that multivariate covariate shift is a big problem, and there is no direct (or single) approach to handle this.

Below, I will share a couple of ideas that I often use myself and have seen others use as well to handle multivariate covariate shift.

Idea #1: Rare possibility of multivariate covariate shift

As long as we are checking two (or even three) features at a time, visual inspection can be used to detect covariate shift.

So here, at times, many simply ignore the possibility of any covariate shift beyond three features.

Select <= 3 features to detect multivariate covariate shift

The rationale is that beyond three features, it is pretty unlikely that:

P(X1), P(X2), P(X3), P(X4),… → all of them individually will almost remain the same.
But their joint distribution P(X1, X2, X3, X4,…) will change.

Thus, it might be fair to limit multivariate covariate shift analysis to just one, two, and three features at a time.

But of course, this may not be always true, which brings us to another idea:

Idea #2: Data reconstruction

This is another cool and practical idea that I used in one of my projects.

Data reconstruction, as the name suggests, revolves around learning a mapping that projects the data to low dimensions and then reconstructs the original data from the low dimensions.

Data reconstruction aims to construct the original data after projection

On a side note, this is precisely what Autoencoders do.

They are a class of neural networks that learn to encode data in a lower-dimensional space, and then decode it back to the original data space.

Using Autoencoders for data reconstruction

The objective is to minimize the data reconstruction error.

It’s like asking the model to learn a mapping that:

Takes some data.
Projects the data to low dimensions.
And then gives us the exact input data back.

So here’s what we can do:

Train an Autoencoder on the original training dataset. This will give us the weights for the neural network model that reconstructs the dataset.
Use this model on new data to check multivariate covariate shift:
- If the reconstruction loss is high, it indicates that the distribution has changed.
- If the reconstruction loss is low, it indicates that the distribution is almost the same.

This makes intuitive sense as well.

But you know, one of the best things about this approach is that it does not need a labeled dataset.

Essentially, Autoencoders aim to reconstruct the dataset without labels.

This is quite useful in real-world models because, as we discussed yesterday, in most cases, the true output predictions on production data are never immediately available.

Instead, they always take some time.

But using Autoencoders, we can still check data reconstruction errors on the unlabeled data.

In fact, we don’t necessarily have to use Autoencoders. It was just an example here.

Other data reconstruction techniques can also be used, such as PCA is one of them.

Nonetheless, when using any data reconstruction approach, it is important to take care of one thing.

Earlier, we discussed that:

If the reconstruction loss is high, it indicates that the distribution has changed.
If the reconstruction loss is low, it indicates that the distribution is almost the same.

But interpreting reconstruction loss can be pretty subjective, and it also needs some context.

For instance, if the reconstruction loss is 0.4 (say), how do we determine whether this is significant?

Yet again, considering the importance of this topic, I don’t want to rush through it.

We shall continue our discussion on this topic tomorrow. In the meantime, it’s over to you:

Think about how you would interpret the reconstruction loss to determine whether covariate shift has stepped in or not.
Also, what could be some limitations of data reconstruction approaches for detecting covariate shift?

I would love to hear from you :)

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

I want to read super-detailed articles

For instance:

Join below to unlock all full articles:

I want to read super-detailed articles

SPONSOR US

Get your product in front of 84,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Daily Dose of Data Science