Jul 23, 2023Liked by Avi Chawla

An extremely comprehensible discussion of an extremely complex topic.

Expand full comment

This topic is very relevant and crucial, thanks for shedding light on it. Quick question, while Glove and Word2Vec models would directly output a static word embedding vector, BERT and other Transformers based models would output a sequence of tokens; they compute the contextualized word embeddings implicitly. So, *what* vector is exactly being used from these Transformers based models as a contextualized embedding (is it one of the Q, K, V vectors)?

Expand full comment

A transformer has a decoder and an encoder. BERT model is entirely encoder-based. So, in each encoder, you have those key, query and vector manipulations for multi-head attention, which are then passed to the feed-forward layer.

BERT has multiple such stacked encoders.

The final embedding is the one that comes out of the last encoder of the entire BERT architecture. If you have fewer encoders stacked and low embedding size -- this is called the bert-based model.

The larger one with more layers and larger embedding size is bert-large.

Expand full comment