Don't Make This Blunder When Using Multiple Embedding Models in Your ML Pipeline
Here's a mistake that often goes unnoticed.
Imagine you have two different models (or sub-networks) in your whole ML pipeline.
Both generate a representation/embedding of the input in the same dimensions (say, 200).
These could also be pre-trained models used to generate embeddings—Bert, XLNet, etc., or even through any embedding network for that matter.
Here, many folks get tempted to make them interact.
They would:
compare these representations
compute their Euclidean distance
compute their cosine similarity, and more.
The rationale is that as the representations have the same dimensions, they can seamlessly interact.
However, that is NOT true, and you should NEVER do that.
Why?
This is because even though these embeddings have the same length (or dimensions), they are not in the same space, i.e., they are out of space.
Out of space means that their axes are not aligned.
To simplify, imagine both embeddings were in a 3D space.
Now, assume that their z-axes are aligned.
But the x-axis of one of them is at an angle to the x-axis of the other.
Now, of course, both embeddings have the same dimensions — 3.
But can you compare them?
No, right?
Similarly, comparing the embeddings from the two networks above would inherently assume that all axes are perfectly aligned.
But this is highly unlikely because there are infinitely many ways axes may orient relative to each other.
Thus, the representations can NEVER be compared, unless generated by the same model.
I vividly remember making this mistake once, and it caused serious trouble in my ML pipeline.
And I think if you are not aware of this, then it is something that can easily go unnoticed.
Instead, concatenation is a better way to leverage multiple embeddings.
The good thing is that concatenation works even if they have unequal dimensions.
👉 Over to you: How do you typically handle embeddings from multiple models?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!