Discussion about this post

User's avatar
sammyne's avatar

About #(activations) in 1 gpt-2 transformer block, why is it 12?

It seems

1. Text & Position embed have no activations;

2. Masked Multi Self Attention layer has 1 softmax only; (does mask and dropout count as activations also?)

3. Feed Forward layer has only 1 GELU;

Could you help to figure out where exactly are the 12 activations?

Expand full comment

No posts