How To Avoid Getting Misled by t-SNE Projections?
Some key and lesser-known observations from t-SNE results.
t-SNE is among the most powerful dimensionality reduction techniques to visualize high-dimensional datasets.
In my experience, most folks have at least heard of the t-SNE algorithm.
In fact, do you know that it was first proposed 15 years ago?
So there’s definitely a reason why it continues to be one of the most powerful dimensionality reduction approaches today.
If you are curious to learn more, I have a full 25 min deep dive on tSNE that explains everything from sctrach: Formulating and Implementing the t-SNE Algorithm From Scratch.
Despite its popularity, many consistently make misleading conclusions from the t-SNE projections of their high-dimensional data.
In this post, I want to point out a few of these mistakes so that you don’t make those mistakes ever.
To begin, the performance of the t-SNE algorithm is primarily reliant on perplexity
— a hyperparameter of t-SNE.
That is why it is considered the most important hyperparameter in the t-SNE algorithm.
Simply put, the perplexity
value provides a rough estimate for the number of neighbors a point may have in a cluster.
And different values of perplexity create very different low-dimensional cluster spaces, as depicted below:
As shown above, most projections do depict the original clusters. However, they vary significantly in shape.
There are five takeaways from the above image:
NEVER make any conclusions about the original cluster shape by looking at these projections.
Different projections have different low-dimensional cluster shapes, and they do not resemble the original cluster shape.
For low perplexity values (5 and 10), cluster shapes significantly differ from the original ones.
Although, in this case, the clusters were color-coded, which provided more clarity. But it may not always be the case, as tSNE is an unsupervised algorithm.
Cluster sizes in a t-SNE plot do not convey anything either.
The dimensions (or coordinates of data points) created by t-SNE in low dimensions have no inherent meaning.
The axes tick labels of the low-dimensional plots are different and somewhat random.
Similar to PCA’s principal components, they offer no interpretability.
The distances between clusters in a projection do not mean anything.
In the original dataset, the blue and red clusters are close.
Yet, most projections do not preserve the global structure of the original dataset.
Strange things happen at
perplexity=2
andperplexity=100
.At
perplexity=2
, the low-dimensional mapping conveys nothing.As discussed earlier, the
perplexity
value provides a rough estimate of the number of neighbors a point may have in a cluster.t-SNE tries to maintain approx. 2 points per cluster. That is why the distortion.
At
perplexity=100
, the global structure is preserved, but the local structure gets distorted.Thus, tweaking the perplexity hyperparameter is extremely critical here.
That is why I mentioned above that it is the most important hyperparameter of this algorithm.
As a concluding note, it is found that the ideal perplexity values typically lie in the range [5,50]
So try experimenting in that range and see what looks promising.
Next time you use t-SNE, consider the above points, as these plots can get tricky to interpret.
This is especially true if you don’t understand the internal workings of this algorithm.
Nonetheless, understanding the algorithm will massively help you develop an intuition on its interpretability.
If you are curious to learn more, I have a full 25-minute deep dive on tSNE that explains everything from scratch: Formulating and Implementing the t-SNE Algorithm From Scratch.
👉 Over to you: What are some other common mistakes people make when using t-SNE?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Why Bagging is So Ridiculously Effective At Variance Reduction?
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Model Compression: A Critical Step Towards Efficient Machine Learning.
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Avi,
I've commented before that I love the fact your blog is a one minute read. In addition to your content, I'm more curious about your process for writing. I notice that sometimes you have 'deep dive links' into longer format reading on a given topic. I am very curious how you write so much, and so frequently, in addition to your data science work. Would you please care to share some insight into this?