Search Results for author: Jiayi Yao

Found 5 papers, 4 papers with code

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

no code implementations • 26 May 2024 • Jiayi Yao, Hanchen Li, YuHan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang

To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.

Language Modelling Large Language Model

Paper
Add Code

White-box Compiler Fuzzing Empowered by Large Language Models

1 code implementation • 24 Oct 2023 • Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, Lingming Zhang

Nonetheless, prompting LLMs with compiler source-code information remains a missing piece of research in compiler testing.

Code Generation Compiler Optimization

Paper
Code

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving

1 code implementation • 11 Oct 2023 • YuHan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, YuYang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3. 5-4. 3x and the total delay in fetching and processing contexts by 3. 2-3. 7x while having negligible impact on the LLM response quality in accuracy or perplexity.

Language Modelling Quantization

Paper
Code

A pruning method based on the dissimilarity of angle among channels and filters

1 code implementation • 29 Oct 2022 • Jiayi Yao, Ping Li, Xiatao Kang, Yuzhe Wang

Firstly, we train a sparse model by GL penalty, and impose an angle dissimilarity constraint on the channels and filters of convolutional network to obtain a more sparse structure.

Network Pruning

Paper
Code

Neural Network Panning: Screening the Optimal Sparse Network Before Training

1 code implementation • 27 Sep 2022 • Xiatao Kang, Ping Li, Jiayi Yao, Chengxi Li

Pruning on neural networks before training not only compresses the original models, but also accelerates the network training phase, which has substantial application value.

Network Pruning Scheduling

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.