Imagine you have two different models (or sub-networks) in your whole ML pipeline. Both generate a representation/embedding of the input in the same dimensions (say, 200).
These could also be pre-trained models used to generate embeddings — BERT, XLNet, etc., or even through any embedding network for that matter.
Here, many folks get tempted to make them interact. They would:
compare these representations
compute their Euclidean distance
compute their cosine similarity, and more.
The rationale is that as the representations have the same dimensions, they can seamlessly interact.
However, that is NOT true, and you should NEVER do that.
Why?
This is because even though these embeddings have the same length (or dimensions), they are not in the same space, i.e., they are out of space.
Out of space means that their axes are not aligned.
To simplify, imagine both embeddings were in a 3D space.
Now, assume that their z-axes are aligned, but the x and y axes of the first is at an angle to the x and y axes of the second:
Now, of course, both embeddings have the same dimensions — 3.
But can you compare them?
No, right?
Similarly, comparing the embeddings from the two networks above would inherently assume that all axes are perfectly aligned.
But this is highly unlikely because there are infinitely many ways axes may orient relative to each other.
Thus, the representations can NEVER be compared, unless generated by the same model.
I vividly remember making this mistake once, and it caused serious trouble in my ML pipeline.
And I think if you are not aware of this, then it is something that can easily go unnoticed.
Instead, I have always found that concatenation is a much better way to leverage multiple embeddings.
The good thing is that concatenation works even if they have unequal dimensions.
👉 Over to you: How do you typically handle embeddings from multiple models?
Are you overwhelmed with the amount of information in ML/DS?
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KANs)
5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
You Are Probably Building Inconsistent Classification Models Without Even Realizing
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 82,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.
you could retrain the two embedding models jointly by minimzing the L2 distance between the embeddings of the same input and maximizing it for different inputs.
this training can be achieved using a contrastive loss.
That's an interesting point... Thank you, AVI CHAWLA !