NormFormer is a type of Pre-LN transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.
Source: NormFormer: Improved Transformer Pretraining with Extra NormalizationPaper | Code | Results | Date | Stars |
---|
Component | Type |
|
---|---|---|
Layer Normalization
|
Normalization | |
Position-Wise Feed-Forward Layer
|
Feedforward Networks |