DistilBERT

Introduced by Sanh et al. in DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.

Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Text Classification	18	7.47%
Sentiment Analysis	18	7.47%
Language Modelling	17	7.05%
Classification	16	6.64%
Question Answering	12	4.98%
Sentence	9	3.73%
Quantization	7	2.90%
Natural Language Understanding	6	2.49%
Model Compression	6	2.49%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
BERT	Language Models

Categories

Add Remove

Transformers

Autoencoding Transformers