The Gaussian Error Linear Unit, or GELU, is an activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.
$$\text{GELU}\left(x\right) = x{P}\left(X\leq{x}\right) = x\Phi\left(x\right) = x \cdot \frac{1}{2}\left[1 + \text{erf}(x/\sqrt{2})\right],$$ if $X\sim \mathcal{N}(0,1)$.
One can approximate the GELU with $0.5x\left(1+\tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^{3}\right)\right]\right)$ or $x\sigma\left(1.702x\right),$ but PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the SiLU $x\sigma(x)$ which was also coined in the paper that introduced the GELU.)
GELUs are used in GPT-3, BERT, and most other Transformers.
Source: Gaussian Error Linear Units (GELUs)Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Retrieval | 91 | 10.59% |
Language Modelling | 70 | 8.15% |
Question Answering | 51 | 5.94% |
Large Language Model | 44 | 5.12% |
Text Generation | 24 | 2.79% |
Sentence | 20 | 2.33% |
In-Context Learning | 18 | 2.10% |
Information Retrieval | 17 | 1.98% |
Code Generation | 16 | 1.86% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |