The t-SNE algorithm is an improved version of the SNE algorithm, both used for dimensionality reduction.

While I assume you know the ins and outs of the SNE and tSNE algorithm (if you don’t, then you can read this detailed article I published once where we also implemented it from scratch)…

…the core idea of **SNE (not t-SNE)** is the following:

Step 1) For every point (

`i`

) in the given high-dimensional data, convert the high-dimensional Euclidean distances to all other points (`j`

) into conditional Gaussian probabilities.For instance, consider the marked red point in the dataset on the left below.

Converting Euclidean distance to all other points into Gaussian probabilities (the distribution on the right above) shows that other red points have a higher probability of being its neighbor than other points.

Step 2) For every data point xᵢ, randomly initialize its counterpart yᵢ in 2-dimensional space. These will be our projections.

Step 3) Just like we defined conditional probabilities in the high-dimensional space in Step 1, we define the conditional probabilities in the low-dimensional space,

**using Gaussian distribution again**.Step 4) Now, every data point (i) has a high-dimensional probability distribution and a corresponding low-dimensional distribution:

The objective is to match these two probability distributions.

Thus, we can make the positions of counterpart yᵢ’s learnable such that this difference is minimized.

Using KL divergence as a loss function helps us achieve this. It measures how much information is lost when we use distribution

`Q`

to approximate distribution`P`

.

Ideally, we want to have the minimum loss value (which is zero), and this will be achieved when

`P=Q`

.

The model can be trained using gradient descent, and it works pretty well.

For instance, the following image depicts a 2-dimensional visualization produced by the SNE algorithm on 256-dimensional handwritten digits:

SNE produces good clusters.

What’s even more astonishing is that properties like orientation, skew, and strokethickness vary smoothly across the space within each cluster. This is depicted below:

Nonetheless, it has some limitations, which the t-SNE algorithm addresses.

Notice that the clusters produced by SNE are not well separated.

Here, it could be fair to assume that the original data clusters, the one in the 256-dimensional space, **most likely would have been well separated**. Thus:

All zeros must have been together but well separated from other digits.

All ones must have been together but well separated from other digits.

And so on.

Yet, SNE still produces tightly packed clusters. This is also called the **“crowding problem.”**

To eliminate this problem, t-SNE was proposed, standing for **t-distributed** Stochastic Neighbor Embedding (t-SNE).

Here’s the difference.

Recall that in SNE, we used a **Gaussian distribution** to define the low-dimensional conditional probabilities. But it is not producing well-separated clusters.

One solution is to use some other probability distribution, such that for distant points, we get the same value of the conditional probability as we would have obtained from a Gaussian distribution **but at a larger Euclidean distance**.

Let me simplify that a bit.

Compare the following two distributions:

Notice that Gaussian achieves a specific value of low probability density at a smaller distance. But t-distribution achieves it at a larger distance.

This is precisely what we intend to achieve.

We need a heavier tail distribution so that we can still minimize the difference between the two probability distributions **but at a larger distance in the low-dimensional space.**

The Student t-distribution is a perfect fit for it.

The following image depicts the difference this change brings:

As shown above:

SNE produces closely packed clusters.

t-SNE produces well-separated clusters.

And that’s why t-distribution is used in t-SNE.

That said, besides producing well-separated clusters, using the Student t-distribution has many more advantages.

For instance, it is computationally much faster to evaluate the density of a point under a Student t-distribution than under a Gaussian.

We went into much more detail in a beginner-friendly manner in the full article: Formulating and Implementing the t-SNE Algorithm From Scratch.

Similar to this, we also formulated the PCA algorithm from scratch here: Formulating the Principal Component Analysis (PCA) Algorithm From Scratch.

👉 Over to you: Can you identify a bottleneck in the t-SNE algorithm?

**Are you overwhelmed with the amount of information in ML/DS?**

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KANs).

5 Must-Know Ways to Test ML Models in Production (Implementation Included).

Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning

8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science

Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming

You Are Probably Building Inconsistent Classification Models Without Even Realizing.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

And many many more.

Join below to unlock all full articles:

**SPONSOR US**

Get your product in front of 77,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space **here** or reply to this email to ensure your product reaches this influential audience.

Well said

An amazing read, thanks for clearly explaining the concepts of SNE, I honestly was not aware of the how the algorithm works and this newsletter perfectly explained it. Thanks Avi Chawla