PockEngine: Sparse and Efficient Fine-tuning in a Pocket

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the...

Full description

Bibliographic Details
Main Authors:	Zhu, Ligeng, Hu, Lanxiang, Lin, Ji, Chen, Wei-Ming, Wang, Wei-Chen, Gan, Chuang, Han, Song
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	English
Published:	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture 2024
Online Access:	https://hdl.handle.net/1721.1/153267

_version_	1811077885626155008
author	Zhu, Ligeng Hu, Lanxiang Lin, Ji Chen, Wei-Ming Wang, Wei-Chen Gan, Chuang Han, Song
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Zhu, Ligeng Hu, Lanxiang Lin, Ji Chen, Wei-Ming Wang, Wei-Chen Gan, Chuang Han, Song
author_sort	Zhu, Ligeng
collection	MIT
description	On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 × speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 × memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9 × faster than the PyTorch.
first_indexed	2024-09-23T10:49:50Z
format	Article
id	mit-1721.1/153267
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T10:49:50Z
publishDate	2024
publisher	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture
record_format	dspace
spelling	mit-1721.1/1532672024-01-04T18:37:04Z PockEngine: Sparse and Efficient Fine-tuning in a Pocket Zhu, Ligeng Hu, Lanxiang Lin, Ji Chen, Wei-Ming Wang, Wei-Chen Gan, Chuang Han, Song Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science MIT-IBM Watson AI Lab On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 × speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 × memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9 × faster than the PyTorch. 2024-01-03T18:41:43Z 2024-01-03T18:41:43Z 2023-10-28 2024-01-01T08:48:08Z Article http://purl.org/eprint/type/ConferencePaper 979-8-4007-0329-4 https://hdl.handle.net/1721.1/153267 Zhu, Ligeng, Hu, Lanxiang, Lin, Ji, Chen, Wei-Ming, Wang, Wei-Chen et al. 2023. "PockEngine: Sparse and Efficient Fine-tuning in a Pocket." PUBLISHER_CC PUBLISHER_CC en https://doi.org/10.1145/3613424.3614307 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture
spellingShingle	Zhu, Ligeng Hu, Lanxiang Lin, Ji Chen, Wei-Ming Wang, Wei-Chen Gan, Chuang Han, Song PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title	PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title_full	PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title_fullStr	PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title_full_unstemmed	PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title_short	PockEngine: Sparse and Efficient Fine-tuning in a Pocket
title_sort	pockengine sparse and efficient fine tuning in a pocket
url	https://hdl.handle.net/1721.1/153267
work_keys_str_mv	AT zhuligeng pockenginesparseandefficientfinetuninginapocket AT hulanxiang pockenginesparseandefficientfinetuninginapocket AT linji pockenginesparseandefficientfinetuninginapocket AT chenweiming pockenginesparseandefficientfinetuninginapocket AT wangweichen pockenginesparseandefficientfinetuninginapocket AT ganchuang pockenginesparseandefficientfinetuninginapocket AT hansong pockenginesparseandefficientfinetuninginapocket

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

Similar Items