Efficient Deployment Algorithms for Large Language Models

Large language models (LLMs) have achieved impressive performance on various natural language tasks. However, their massive computational and memory requirements hinder widespread deployment. Additionally, deploying them on extensive inputs presents efficiency and accuracy challenges. This proposal...

Full description

Bibliographic Details
Main Author:	Xiao, Guangxuan
Other Authors:	Han, Song
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156332 https://orcid.org/0000-0002-7182-9284

_version_	1811086228571815936
author	Xiao, Guangxuan
author2	Han, Song
author_facet	Han, Song Xiao, Guangxuan
author_sort	Xiao, Guangxuan
collection	MIT
description	Large language models (LLMs) have achieved impressive performance on various natural language tasks. However, their massive computational and memory requirements hinder widespread deployment. Additionally, deploying them on extensive inputs presents efficiency and accuracy challenges. This proposal introduces two techniques to enable efficient and accurate quantization and streaming deployment of LLMs, facilitating their application in real-world systems with limited resources. First, we develop SmoothQuant, an accurate post-training 8-bit quantization method of both weights and activations in LLMs up to 530B parameters. By smoothing outliers in activations, SmoothQuant enables the use of efficient INT8 kernels on all matrix multiplications with negligible accuracy loss. Second, we present StreamingLLM, enabling LLMs to handle arbitrarily long text sequences using a fixed memory budget. It exploits ``attention sinks'' in LLMs to stably anchor attention computation on lengthy contexts. Experiments show StreamingLLM can model over 4 million tokens with up to 22x speedup compared to recomputation baselines. Together, these two techniques can significantly reduce the computational and memory costs of large language models, increasing their accessibility for practical usage.
first_indexed	2024-09-23T13:22:50Z
format	Thesis
id	mit-1721.1/156332
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T13:22:50Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1563322024-08-22T03:30:13Z Efficient Deployment Algorithms for Large Language Models Xiao, Guangxuan Han, Song Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Large language models (LLMs) have achieved impressive performance on various natural language tasks. However, their massive computational and memory requirements hinder widespread deployment. Additionally, deploying them on extensive inputs presents efficiency and accuracy challenges. This proposal introduces two techniques to enable efficient and accurate quantization and streaming deployment of LLMs, facilitating their application in real-world systems with limited resources. First, we develop SmoothQuant, an accurate post-training 8-bit quantization method of both weights and activations in LLMs up to 530B parameters. By smoothing outliers in activations, SmoothQuant enables the use of efficient INT8 kernels on all matrix multiplications with negligible accuracy loss. Second, we present StreamingLLM, enabling LLMs to handle arbitrarily long text sequences using a fixed memory budget. It exploits ``attention sinks'' in LLMs to stably anchor attention computation on lengthy contexts. Experiments show StreamingLLM can model over 4 million tokens with up to 22x speedup compared to recomputation baselines. Together, these two techniques can significantly reduce the computational and memory costs of large language models, increasing their accessibility for practical usage. S.M. 2024-08-21T18:57:28Z 2024-08-21T18:57:28Z 2024-05 2024-07-10T13:00:03.136Z Thesis https://hdl.handle.net/1721.1/156332 https://orcid.org/0000-0002-7182-9284 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Xiao, Guangxuan Efficient Deployment Algorithms for Large Language Models
title	Efficient Deployment Algorithms for Large Language Models
title_full	Efficient Deployment Algorithms for Large Language Models
title_fullStr	Efficient Deployment Algorithms for Large Language Models
title_full_unstemmed	Efficient Deployment Algorithms for Large Language Models
title_short	Efficient Deployment Algorithms for Large Language Models
title_sort	efficient deployment algorithms for large language models
url	https://hdl.handle.net/1721.1/156332 https://orcid.org/0000-0002-7182-9284
work_keys_str_mv	AT xiaoguangxuan efficientdeploymentalgorithmsforlargelanguagemodels

Efficient Deployment Algorithms for Large Language Models

Similar Items