Efficient Deep Learning Computing: From TinyML to LargeLM
Deep learning has prevailed in various fields and fundamentally changed human society. Efficiency is the key factor in democratizing deep learning and broadening its applications. It is increasingly important as Moore’s law slows down while the model size scaling speeds up. We need efficient algorit...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/153837 https://orcid.org/0000-0001-6053-4344 |
_version_ | 1826191320010457088 |
---|---|
author | Lin, Ji |
author2 | Han, Song |
author_facet | Han, Song Lin, Ji |
author_sort | Lin, Ji |
collection | MIT |
description | Deep learning has prevailed in various fields and fundamentally changed human society. Efficiency is the key factor in democratizing deep learning and broadening its applications. It is increasingly important as Moore’s law slows down while the model size scaling speeds up. We need efficient algorithms and systems to help us bridge the gap.
In this thesis, we will discuss techniques to improve the efficiency of deep learning by removing redundancies. We study efficient deep learning computing at the two extremes of scaling: tiny machine learning (TinyML) and large language models (LLMs). TinyML aims to run deep learning models on low-power IoT devices with tight memory constraints. Weexplored a system-algorithm co-design approach to remove redundant memory usage and enable real-life applications on commercial microcontrollers, achieving a milestone ImageNet accuracy of 70% for the first time. We further extend the solution from inference to training and enable on-device learning under only 256KB of memory. Similar to TinyML, the gigantic model sizes of LLMs also exceed the hardware capability even for the most advanced GPUs. We developed post-training quantization schemes for different serving workloads to reduce redundant bits of weights and activations, enabling W8A8 quantization (SmoothQuant) for compute-bound inference and W4A16 quantization (AWQ) for memorybound. We further develop TinyChat, an efficient and Python-native serving system, to realize the speedup from quantization. Finally, we will discuss some domain-specific optimization opportunities, including efficient video recognition with Temporal Shift Module (TSM) and image generation with Anycost GANs, where we reduce application-specific redundancies with specialized model designs. |
first_indexed | 2024-09-23T08:54:05Z |
format | Thesis |
id | mit-1721.1/153837 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:54:05Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1538372024-03-22T04:04:51Z Efficient Deep Learning Computing: From TinyML to LargeLM Lin, Ji Han, Song Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Deep learning has prevailed in various fields and fundamentally changed human society. Efficiency is the key factor in democratizing deep learning and broadening its applications. It is increasingly important as Moore’s law slows down while the model size scaling speeds up. We need efficient algorithms and systems to help us bridge the gap. In this thesis, we will discuss techniques to improve the efficiency of deep learning by removing redundancies. We study efficient deep learning computing at the two extremes of scaling: tiny machine learning (TinyML) and large language models (LLMs). TinyML aims to run deep learning models on low-power IoT devices with tight memory constraints. Weexplored a system-algorithm co-design approach to remove redundant memory usage and enable real-life applications on commercial microcontrollers, achieving a milestone ImageNet accuracy of 70% for the first time. We further extend the solution from inference to training and enable on-device learning under only 256KB of memory. Similar to TinyML, the gigantic model sizes of LLMs also exceed the hardware capability even for the most advanced GPUs. We developed post-training quantization schemes for different serving workloads to reduce redundant bits of weights and activations, enabling W8A8 quantization (SmoothQuant) for compute-bound inference and W4A16 quantization (AWQ) for memorybound. We further develop TinyChat, an efficient and Python-native serving system, to realize the speedup from quantization. Finally, we will discuss some domain-specific optimization opportunities, including efficient video recognition with Temporal Shift Module (TSM) and image generation with Anycost GANs, where we reduce application-specific redundancies with specialized model designs. Ph.D. 2024-03-21T19:09:19Z 2024-03-21T19:09:19Z 2024-02 2024-02-21T17:18:52.793Z Thesis https://hdl.handle.net/1721.1/153837 https://orcid.org/0000-0001-6053-4344 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Lin, Ji Efficient Deep Learning Computing: From TinyML to LargeLM |
title | Efficient Deep Learning Computing: From TinyML to LargeLM |
title_full | Efficient Deep Learning Computing: From TinyML to LargeLM |
title_fullStr | Efficient Deep Learning Computing: From TinyML to LargeLM |
title_full_unstemmed | Efficient Deep Learning Computing: From TinyML to LargeLM |
title_short | Efficient Deep Learning Computing: From TinyML to LargeLM |
title_sort | efficient deep learning computing from tinyml to largelm |
url | https://hdl.handle.net/1721.1/153837 https://orcid.org/0000-0001-6053-4344 |
work_keys_str_mv | AT linji efficientdeeplearningcomputingfromtinymltolargelm |