Effortlessly Scale tSNE to Millions of Data Points With openTSNE
The most optimized CPU implementation of tSNE.
In yesterday’s post, we discussed accelerating the tSNE algorithm using GPUs for faster processing.
Here’s the visual from that post for a quick recap:
In a gist, the idea was to use tSNE-CUDA, which is an optimized CUDA version of the tSNE algorithm, which, as the name suggests, can leverage hardware accelerators.
And why is an optimized implementation needed in the first place?
It’s needed because the biggest issue with tSNE (which we also discussed here) is that its run-time is quadratically related to the number of data points.
Thus, it can get pretty difficult to use tSNE from Sklearn for large datasets.
tSNE-CUDA addressed this by providing immense speedups over the standard Sklearn implementation using a GPU.
However, after releasing that post, a few readers showed concerns about not having dedicated access to a GPU to use tSNE-CUDA.
They asked if there’s any alternative optimized implementation of tSNE that can run on a CPU?
Answer: Of course, there is!
openTSNE is another optimized Python implementation of t-SNE I discovered during my exploration, which provides massive speed improvements and enables us to scale t-SNE to millions of data points — a place where Sklearn implementation may never reach.
The effectiveness is evident from the image below:
As depicted above, the openTSNE implementation:
is 20 times faster than the Sklearn implementation.
produces similar quality clustering as the Sklearn implementation.
The authors have also provided the following benchmarking results:
As depicted above, openTSNE can produce low dimensional visualization of a million data points in just ~15 minutes.
However, it is clear from their benchmarks that the run-time of the Sklearn implementation has already reached a couple of hours with just ~250k data points.
Isn’t that an insane speedup, that too, without ever utilizing a GPU?
Download the notebook here to try out openTSNE: openTSNE Jupyter notebook.
👉 Over to you: What are some other ways to boost the tSNE algorithm?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Why Bagging is So Ridiculously Effective At Variance Reduction?
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Hey Avi, thanks for the content. Really learning a lot daily from you post.
Thanks for the content. This is the best data science newsletter I've ever read.