Adam

Introduced by Kingma et al. in Adam: A Method for Stochastic Optimization

Adam is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of RMSProp and SGD w/th Momentum. The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.

The weight updates are performed as:

$$ w_{t} = w_{t-1} - \eta\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon} $$

with

$$ \hat{m}_{t} = \frac{m_{t}}{1-\beta^{t}_{1}} $$

$$ \hat{v}_{t} = \frac{v_{t}}{1-\beta^{t}_{2}} $$

$$ m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t} $$

$$ v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2} $$

$ \eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \beta_{1} $ and $ \beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.

Source: Adam: A Method for Stochastic Optimization

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	60	7.15%
Retrieval	42	5.01%
Question Answering	37	4.41%
Large Language Model	32	3.81%
Decoder	24	2.86%
Image Classification	15	1.79%
Text Generation	15	1.79%
Sentence	14	1.67%
Semantic Segmentation	14	1.67%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Stochastic Optimization

Optimization

Large Batch Optimization