Search Results for author: Noam Shazeer

Found 39 papers, 27 papers with code

PaLM: Scaling Language Modeling with Pathways

6 code implementations • Google Research 2022 • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel

To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.

Ranked #1 on Coreference Resolution on Winograd Schema Challenge

Auto Debugging Code Generation +17

993

Paper
Code

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

3 code implementations • 31 Mar 2022 • Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves.

Decoder

2,547

Paper
Code

ST-MoE: Designing Stable and Transferable Sparse Expert Models

2 code implementations • 17 Feb 2022 • Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.

Ranked #1 on Common Sense Reasoning on ARC (Easy)

Common Sense Reasoning Coreference Resolution +7

1,563

Paper
Code

LaMDA: Language Models for Dialog Applications

2 code implementations • 20 Jan 2022 • Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Yaguang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le

We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

Ranked #115 on Code Generation on HumanEval

Code Generation Information Retrieval +2

457

Paper
Code

Searching for Efficient Transformers for Language Modeling

no code implementations • NeurIPS 2021 • David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc Le

For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.

Language Modelling

Paper
Add Code

Primer: Searching for Efficient Transformers for Language Modeling

4 code implementations • 17 Sep 2021 • David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le

For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.

Ranked #1 on Language Modelling on C4

Language Modelling

49,637

Paper
Code

GSPMD: General and Scalable Parallelization for ML Computation Graphs

2 code implementations • 10 May 2021 • Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, Zhifeng Chen

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations.

Playing the Game of 2048

945

Paper
Code

Do Transformer Modifications Transfer Across Implementations and Applications?

1 code implementation • EMNLP 2021 • Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption.

33,187

Paper
Code

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

6 code implementations • 11 Jan 2021 • William Fedus, Barret Zoph, Noam Shazeer

We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.

Language Modelling Question Answering

49,637

Paper
Code

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

2 code implementations • ICLR 2021 • Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute.

Machine Translation Playing the Game of 2048 +1

2,962

Paper
Code

Talking-Heads Attention

4 code implementations • 5 Mar 2020 • Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation. While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

Language Modelling Masked Language Modeling +2

76,727

Paper
Code

GLU Variants Improve Transformer

21 code implementations • 12 Feb 2020 • Noam Shazeer

Gated Linear Units (arXiv:1612. 08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.

49,637

Paper
Code

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

3 code implementations • EMNLP 2020 • Adam Roberts, Colin Raffel, Noam Shazeer

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries.

Language Modelling Natural Language Queries

33,187

Paper
Code

Faster Transformer Decoding: N-gram Masked Self-Attention

no code implementations • 14 Jan 2020 • Ciprian Chelba, Mia Chen, Ankur Bapna, Noam Shazeer

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption.

Sentence

Paper
Add Code

Fast Transformer Decoding: One Write-Head is All You Need

2 code implementations • 6 Nov 2019 • Noam Shazeer

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences.

Language Modelling Large Language Model

798

Paper
Code

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

51 code implementations • arXiv 2019 • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).

Ranked #1 on Sentiment Analysis on SST-2 Binary classification

Answer Generation Common Sense Reasoning +11

127,344

Paper
Code

High Resolution Medical Image Analysis with Spatial Partitioning

1 code implementation • 6 Sep 2019 • Le Hou, Youlong Cheng, Noam Shazeer, Niki Parmar, Yeqing Li, Panagiotis Korfiatis, Travis M. Drucker, Daniel J. Blezek, Xiaodan Song

It is infeasible to train CNN models directly on such high resolution images, because neural activations of a single image do not fit in the memory of a single GPU/TPU, and naive data and model parallelism approaches do not work.

Vocal Bursts Intensity Prediction

1,563

Paper
Code

Corpora Generation for Grammatical Error Correction

no code implementations • NAACL 2019 • Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, Simon Tong

We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

Grammatical Error Correction Machine Translation +1

Paper
Add Code

Blockwise Parallel Decoding for Deep Autoregressive Models

no code implementations • NeurIPS 2018 • Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years.

Decoder Image Super-Resolution +2

Paper
Add Code

Mesh-TensorFlow: Deep Learning for Supercomputers

1 code implementation • NeurIPS 2018 • Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman

We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model.

Ranked #10 on Language Modelling on One Billion Word

Language Modelling

1,563

Paper
Code

Weakly Supervised Grammatical Error Correction using Iterative Decoding

no code implementations • 31 Oct 2018 • Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar

We describe an approach to Grammatical Error Correction (GEC) that is effective at making use of models trained on large amounts of weakly supervised bitext.

Grammatical Error Correction

Paper
Add Code

Music Transformer

12 code implementations • ICLR 2019 • Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck

This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length.

Ranked #3 on Music Modeling on JSB Chorales

Music Generation Music Modeling

604

Paper
Code

HydraNets: Specialized Dynamic Architectures for Efficient Inference

no code implementations • CVPR 2018 • Ravi Teja Mullapudi, William R. Mark, Noam Shazeer, Kayvon Fatahalian

On ImageNet, applying the HydraNet template improves accuracy up to 2. 5% when compared to an efficient baseline architecture with similar inference cost.

Classification Computational Efficiency +2

Paper
Add Code

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

4 code implementations • ICML 2018 • Noam Shazeer, Mitchell Stern

In several recently proposed stochastic optimization methods (e. g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients.

Machine Translation Stochastic Optimization +1

743

Paper
Code

Tensor2Tensor for Neural Machine Translation

14 code implementations • WS 2018 • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit

Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.

Machine Translation Translation

15,036

Paper
Code

Fast Decoding in Sequence Models using Discrete Latent Variables

no code implementations • ICML 2018 • Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, Noam Shazeer

Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models.

Machine Translation Translation

Paper
Add Code

Image Transformer

no code implementations • 15 Feb 2018 • Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran

Image generation has been successfully cast as an autoregressive sequence generation or transformation problem.

Ranked #3 on Density Estimation on CIFAR-10

Decoder Density Estimation +2

Paper
Add Code

Generating Wikipedia by Summarizing Long Sequences

4 code implementations • ICLR 2018 • Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents.

Decoder Document Summarization +3

15,036

Paper
Code

Large Scale Multi-Domain Multi-Task Learning with MultiModel

no code implementations • ICLR 2018 • Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

We present a single model that yields good results on a number of problems spanning multiple domains.

Image Captioning Image Classification +4

Paper
Add Code

One Model To Learn Them All

1 code implementation • 16 Jun 2017 • Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

We present a single model that yields good results on a number of problems spanning multiple domains.

Image Captioning Image Classification +3

15,036

Paper
Code

Attention Is All You Need

575 code implementations • NeurIPS 2017 • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)

Abstractive Text Summarization Coreference Resolution +10

127,344

Paper
Code

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

4 code implementations • 23 Jan 2017 • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.

Ranked #14 on Language Modelling on One Billion Word

Computational Efficiency Language Modelling +2

873

Paper
Code

NN-grams: Unifying neural network and n-gram language models for Speech Recognition

no code implementations • 23 Jun 2016 • Babak Damavandi, Shankar Kumar, Noam Shazeer, Antoine Bruguier

The model is trained using noise contrastive estimation (NCE), an approach that transforms the estimation problem of neural networks into one of binary classification between data samples and noise samples.

Binary Classification Language Modelling +3

Paper
Add Code

Exploring the Limits of Language Modeling

10 code implementations • 7 Feb 2016 • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding.

Ranked #8 on Language Modelling on One Billion Word

Language Modelling

76,728

Paper
Code

Swivel: Improving Embeddings by Noticing What's Missing

3 code implementations • 6 Feb 2016 • Noam Shazeer, Ryan Doherty, Colin Evans, Chris Waterson

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix.

76,728

Paper
Code

Sparse Non-negative Matrix Language Modeling

no code implementations • TACL 2016 • Joris Pelemans, Noam Shazeer, Ciprian Chelba

We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus.

Automatic Speech Recognition (ASR) Language Modelling +1

Paper
Add Code

End-to-End Text-Dependent Speaker Verification

3 code implementations • 27 Sep 2015 • Georg Heigold, Ignacio Moreno, Samy Bengio, Noam Shazeer

In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.

Text-Dependent Speaker Verification

349

Paper
Code

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

9 code implementations • NeurIPS 2015 • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer

Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning.

Constituency Parsing Image Captioning +2

519

Paper
Code

Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

no code implementations • 3 Dec 2014 • Noam Shazeer, Joris Pelemans, Ciprian Chelba

We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation.

Ranked #23 on Language Modelling on One Billion Word

Language Modelling

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.