The Limitation of Static Embeddings Which…

Avi Chawla

Jul 23, 2023

A visual guide to context-aware embeddings.

Read →

3 Comments

Mihály Nemes

Jul 23, 2023

An extremely comprehensible discussion of an extremely complex topic.

Expand full comment

Omar AlSuwaidi

Aug 22, 2023

This topic is very relevant and crucial, thanks for shedding light on it. Quick question, while Glove and Word2Vec models would directly output a static word embedding vector, BERT and other Transformers based models would output a sequence of tokens; they compute the contextualized word embeddings implicitly. So, *what* vector is exactly being used from these Transformers based models as a contextualized embedding (is it one of the Q, K, V vectors)?

Expand full comment

Reply (1)

Avi Chawla

Aug 28, 2023

A transformer has a decoder and an encoder. BERT model is entirely encoder-based. So, in each encoder, you have those key, query and vector manipulations for multi-head attention, which are then passed to the feed-forward layer.

BERT has multiple such stacked encoders.

The final embedding is the one that comes out of the last encoder of the entire BERT architecture. If you have fewer encoders stacked and low embedding size -- this is called the bert-based model.

The larger one with more layers and larger embedding size is bert-large.

Expand full comment

Daily Dose of Data Science

The Limitation of Static Embeddings Which…