Estimating memory consumption of GPT-2.
About #(activations) in 1 gpt-2 transformer block, why is it 12?
It seems
1. Text & Position embed have no activations;
2. Masked Multi Self Attention layer has 1 softmax only; (does mask and dropout count as activations also?)
3. Feed Forward layer has only 1 GELU;
Could you help to figure out where exactly are the 12 activations?
About #(activations) in 1 gpt-2 transformer block, why is it 12?
It seems
1. Text & Position embed have no activations;
2. Masked Multi Self Attention layer has 1 softmax only; (does mask and dropout count as activations also?)
3. Feed Forward layer has only 1 GELU;
Could you help to figure out where exactly are the 12 activations?