PAR Transformer

Introduced by Mandava et al. in Pay Attention when Required

PAR Transformer is a Transformer model that uses 63% fewer self-attention blocks, replacing them with feed-forward blocks, while retaining test accuracies. It is based on the Transformer-XL architecture and uses neural architecture search to find an an efficient pattern of blocks in the transformer architecture.

Source: Pay Attention when Required

Read Paper See Code

Paper	Code	Results	Date	Stars

This feature is experimental; we are continuously improving our matching algorithm.

Component	Type	Add Remove
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Position-Wise Feed-Forward Layer	Feedforward Networks
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms