GPT-2

Introduced by Radford et al. in Language Models are Unsupervised Multitask Learners

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:

Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.
A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.
The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Source: Language Models are Unsupervised Multitask Learners

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	159	18.84%
Text Generation	92	10.90%
Sentence	38	4.50%
Decoder	29	3.44%
Question Answering	24	2.84%
Retrieval	17	2.01%
In-Context Learning	12	1.42%
Translation	12	1.42%
Large Language Model	11	1.30%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Attention Dropout	Regularization
BPE	Subword Segmentation
Dense Connections	Feedforward Networks
Discriminative Fine-Tuning	Fine-Tuning
Dropout	Regularization
GELU	Activation Functions
Layer Normalization	Normalization
Linear Warmup With Cosine Annealing	Learning Rate Schedules
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
Weight Decay	Regularization

Categories

Add Remove

Transformers

Autoregressive Transformers