Search Results for author: Beidi Chen

Found 42 papers, 25 papers with code

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

no code implementations • 5 Jun 2024 • Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes.

Quantization

Paper
Add Code

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

1 code implementation • 4 Jun 2024 • Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.

Quantization

Paper
Code

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

no code implementations • 29 May 2024 • Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin

Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations.

Language Modelling Retrieval

Paper
Add Code

Memory Mosaics

1 code implementation • 10 May 2024 • Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Léon Bottou

Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest.

In-Context Learning Language Modelling

Paper
Code

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

no code implementations • 25 Apr 2024 • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs).

Continual Pretraining Semantic Parsing

Paper
Add Code

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

1 code implementation • 18 Apr 2024 • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen

However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length.

129

Paper
Code

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

1 code implementation • 12 Apr 2024 • Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy.

400

Paper
Code

Prompt-prompted Mixture of Experts for Efficient LLM Generation

1 code implementation • 1 Apr 2024 • Harry Dong, Beidi Chen, Yuejie Chi

With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment.

Paper
Code

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

1 code implementation • 6 Mar 2024 • Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian

Our approach reduces memory usage by up to 65. 5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19. 7B tokens, and on fine-tuning RoBERTa on GLUE tasks.

1,195

Paper
Code

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

1 code implementation • 5 Mar 2024 • Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang Wang

To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead.

Language Modelling

Paper
Code

LLM Inference Unveiled: Survey and Roofline Model Insights

2 code implementations • 26 Feb 2024 • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer

Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.

Knowledge Distillation Language Modelling +3

195

Paper
Code

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

1 code implementation • 19 Feb 2024 • Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.

271

Paper
Code

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

1 code implementation • 14 Feb 2024 • Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen

Many computational factors limit broader deployment of large language models.

Paper
Code

Learn To be Efficient: Build Structured Sparsity in Large Language Models

no code implementations • 9 Feb 2024 • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash

Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.

Text Generation

Paper
Add Code

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

1 code implementation • 5 Feb 2024 • Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

This memory demand increases with larger batch sizes and longer context lengths.

Quantization

143

Paper
Code

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

1 code implementation • 26 Oct 2023 • Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen

We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability.

In-Context Learning

234

Paper
Code

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

1 code implementation • 1 Oct 2023 • Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures.

269

Paper
Code

Efficient Streaming Language Models with Attention Sinks

5 code implementations • 29 Sep 2023 • Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.

Language Modelling

6,299

Paper
Code

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

1 code implementation • 24 Jun 2023 • Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen

Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens.

299

Paper
Code

InRank: Incremental Low-Rank Learning

1 code implementation • 20 Jun 2023 • Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar

To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training.

Computational Efficiency

211

Paper
Code

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

no code implementations • 17 May 2023 • Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, Anshumali Shrivastava

Thus, optimizing this accuracy-efficiency trade-off is crucial for the LLM deployment on commodity hardware.

Model Compression Quantization

Paper
Add Code

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

1 code implementation • 13 Mar 2023 • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.

Language Modelling Large Language Model

9,047

Paper
Code

Sample-efficient Surrogate Model for Frequency Response of Linear PDEs using Self-Attentive Complex Polynomials

no code implementations • 6 Jan 2023 • Andrew Cohen, Weiping Dou, Jiang Zhu, Slawomir Koziel, Peter Renner, Jan-Ove Mattsson, Xiaomeng Yang, Beidi Chen, Kevin Stone, Yuandong Tian

Linear Partial Differential Equations (PDEs) govern the spatial-temporal dynamics of physical systems that are essential to building modern technology.

Paper
Add Code

Decentralized Training of Foundation Models in Heterogeneous Environments

1 code implementation • 2 Jun 2022 • Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network.

Scheduling

Paper
Code

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

1 code implementation • 2 Jun 2022 • Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks.

Paper
Code

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

1 code implementation • 1 Apr 2022 • Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré

To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms).

Language Modelling MRI Reconstruction

177

Paper
Code

Locality Sensitive Teaching

no code implementations • NeurIPS 2021 • Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, Anshumali Shrivastava

However, as one of the most influential and practical MT paradigms, iterative machine teaching (IMT) is prohibited on IoT devices due to its inefficient and unscalable algorithms.

Paper
Add Code

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

1 code implementation • ICLR 2022 • Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré

To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.

Language Modelling

177

Paper
Code

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

Image Generation Language Modelling

177

Paper
Code

Scatterbrain: Unifying Sparse and Low-rank Attention

1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Image Generation Language Modelling

177

Paper
Code

MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training

no code implementations • ICLR 2021 • Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, Christopher Re

Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.

Efficient Neural Network Language Modelling +2

Paper
Add Code

A Truly Constant-time Distribution-aware Negative Sampling

no code implementations • 1 Jan 2021 • Shabnam Daghaghi, Tharun Medini, Beidi Chen, Mengnan Zhao, Anshumali Shrivastava

Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval.

Information Retrieval Retrieval

Paper
Add Code

A Tale of Two Efficient and Informative Negative Sampling Distributions

no code implementations • 31 Dec 2020 • Shabnam Daghaghi, Tharun Medini, Nicholas Meisburger, Beidi Chen, Mengnan Zhao, Anshumali Shrivastava

Unfortunately, due to the dynamically updated parameters and data samples, there is no sampling scheme that is provably adaptive and samples the negative classes efficiently.

Information Retrieval Retrieval +1

Paper
Add Code

SOLAR: Sparse Orthogonal Learned and Random Embeddings

no code implementations • ICLR 2021 • Tharun Medini, Beidi Chen, Anshumali Shrivastava

The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse.

Multi-Label Classification

Paper
Add Code

Climbing the WOL: Training for Cheaper Inference

no code implementations • 2 Jul 2020 • Zichang Liu, Zhaozhuo Xu, Alan Ji, Jonathan Li, Beidi Chen, Anshumali Shrivastava

Efficient inference for wide output layers (WOLs) is an essential yet challenging task in large scale machine learning.

Retrieval

Paper
Add Code

Angular Visual Hardness

no code implementations • ICML 2020 • Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima Anandkumar

We also find that AVH has a statistically significant correlation with human visual hardness.

Domain Generalization

Paper
Add Code

Fast and Accurate Stochastic Gradient Estimation

1 code implementation • NeurIPS 2019 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava

In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling.

Paper
Code

Lsh-sampling Breaks the Computation Chicken-and-egg Loop in Adaptive Stochastic Gradient Estimation

no code implementations • 30 Oct 2019 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava

Paper
Add Code

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

3 code implementations • 7 Mar 2019 • Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava

On the same CPU hardware, SLIDE is over 10x faster than TF.

1,066

Paper
Code

LSH-SAMPLING BREAKS THE COMPUTATIONAL CHICKEN-AND-EGG LOOP IN ADAPTIVE STOCHASTIC GRADIENT ESTIMATION

no code implementations • ICLR 2018 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava

In this paper, we break this barrier by providing the first demonstration of a sampling scheme, which leads to superior gradient estimation, while keeping the sampling cost per iteration similar to that of the uniform sampling.

Paper
Add Code

Revisiting Winner Take All (WTA) Hashing for Sparse Datasets

no code implementations • 6 Dec 2016 • Beidi Chen, Anshumali Shrivastava

WTA (Winner Take All) hashing has been successfully applied in many large scale vision applications.

General Classification Image Classification +1

Paper
Add Code

Sub-Linear Privacy-Preserving Near-Neighbor Search

no code implementations • 6 Dec 2016 • M. Sadegh Riazi, Beidi Chen, Anshumali Shrivastava, Dan Wallach, Farinaz Koushanfar

In Near-Neighbor Search (NNS), a new client queries a database (held by a server) for the most similar data (near-neighbors) given a certain similarity metric.

Privacy Preserving

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.