no code implementations • 5 Jun 2024 • Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes.
1 code implementation • 4 Jun 2024 • Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.
no code implementations • 29 May 2024 • Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations.
1 code implementation • 10 May 2024 • Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Léon Bottou
Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest.
no code implementations • 25 Apr 2024 • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs).
1 code implementation • 18 Apr 2024 • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen
However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length.
1 code implementation • 12 Apr 2024 • Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy.
1 code implementation • 1 Apr 2024 • Harry Dong, Beidi Chen, Yuejie Chi
With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment.
1 code implementation • 6 Mar 2024 • Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian
Our approach reduces memory usage by up to 65. 5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19. 7B tokens, and on fine-tuning RoBERTa on GLUE tasks.
1 code implementation • 5 Mar 2024 • Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang Wang
To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead.
2 code implementations • 26 Feb 2024 • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.
1 code implementation • 19 Feb 2024 • Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.
1 code implementation • 14 Feb 2024 • Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen
Many computational factors limit broader deployment of large language models.
no code implementations • 9 Feb 2024 • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.
1 code implementation • 5 Feb 2024 • Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
This memory demand increases with larger batch sizes and longer context lengths.
1 code implementation • 26 Oct 2023 • Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability.
1 code implementation • 1 Oct 2023 • Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures.
5 code implementations • 29 Sep 2023 • Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.
1 code implementation • 24 Jun 2023 • Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen
Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens.
1 code implementation • 20 Jun 2023 • Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar
To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training.
no code implementations • 17 May 2023 • Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, Anshumali Shrivastava
Thus, optimizing this accuracy-efficiency trade-off is crucial for the LLM deployment on commodity hardware.
1 code implementation • 13 Mar 2023 • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang
As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.
no code implementations • 6 Jan 2023 • Andrew Cohen, Weiping Dou, Jiang Zhu, Slawomir Koziel, Peter Renner, Jan-Ove Mattsson, Xiaomeng Yang, Beidi Chen, Kevin Stone, Yuandong Tian
Linear Partial Differential Equations (PDEs) govern the spatial-temporal dynamics of physical systems that are essential to building modern technology.
1 code implementation • 2 Jun 2022 • Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang
Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network.
1 code implementation • 2 Jun 2022 • Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang
Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks.
1 code implementation • 1 Apr 2022 • Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré
To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms).
no code implementations • NeurIPS 2021 • Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, Anshumali Shrivastava
However, as one of the most influential and practical MT paradigms, iterative machine teaching (IMT) is prohibited on IoT devices due to its inefficient and unscalable algorithms.
1 code implementation • ICLR 2022 • Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré
To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.
1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré
Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.
1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré
Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.
no code implementations • ICLR 2021 • Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, Christopher Re
Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.
no code implementations • 1 Jan 2021 • Shabnam Daghaghi, Tharun Medini, Beidi Chen, Mengnan Zhao, Anshumali Shrivastava
Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval.
no code implementations • 31 Dec 2020 • Shabnam Daghaghi, Tharun Medini, Nicholas Meisburger, Beidi Chen, Mengnan Zhao, Anshumali Shrivastava
Unfortunately, due to the dynamically updated parameters and data samples, there is no sampling scheme that is provably adaptive and samples the negative classes efficiently.
no code implementations • ICLR 2021 • Tharun Medini, Beidi Chen, Anshumali Shrivastava
The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse.
no code implementations • 2 Jul 2020 • Zichang Liu, Zhaozhuo Xu, Alan Ji, Jonathan Li, Beidi Chen, Anshumali Shrivastava
Efficient inference for wide output layers (WOLs) is an essential yet challenging task in large scale machine learning.
no code implementations • ICML 2020 • Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima Anandkumar
We also find that AVH has a statistically significant correlation with human visual hardness.
1 code implementation • NeurIPS 2019 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava
In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling.
no code implementations • 30 Oct 2019 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava
In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling.
3 code implementations • 7 Mar 2019 • Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava
On the same CPU hardware, SLIDE is over 10x faster than TF.
no code implementations • ICLR 2018 • Beidi Chen, Yingchen Xu, Anshumali Shrivastava
In this paper, we break this barrier by providing the first demonstration of a sampling scheme, which leads to superior gradient estimation, while keeping the sampling cost per iteration similar to that of the uniform sampling.
no code implementations • 6 Dec 2016 • Beidi Chen, Anshumali Shrivastava
WTA (Winner Take All) hashing has been successfully applied in many large scale vision applications.
no code implementations • 6 Dec 2016 • M. Sadegh Riazi, Beidi Chen, Anshumali Shrivastava, Dan Wallach, Farinaz Koushanfar
In Near-Neighbor Search (NNS), a new client queries a database (held by a server) for the most similar data (near-neighbors) given a certain similarity metric.