Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.
Source: Fast Transformer Decoding: One Write-Head is All You NeedPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 6 | 17.65% |
Answer Generation | 1 | 2.94% |
Decoder | 1 | 2.94% |
Document Classification | 1 | 2.94% |
Image Classification | 1 | 2.94% |
Auto Debugging | 1 | 2.94% |
Code Generation | 1 | 2.94% |
Common Sense Reasoning | 1 | 2.94% |
Coreference Resolution | 1 | 2.94% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |