Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The ex...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-07-01
|
Series: | Machine Learning and Knowledge Extraction |
Subjects: | |
Online Access: | https://www.mdpi.com/2504-4990/5/3/45 |
_version_ | 1797579126623698944 |
---|---|
author | Ala Alam Falaki Robin Gras |
author_facet | Ala Alam Falaki Robin Gras |
author_sort | Ala Alam Falaki |
collection | DOAJ |
description | This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub. |
first_indexed | 2024-03-10T22:32:30Z |
format | Article |
id | doaj.art-f0e855c71a194b90bc4823f2223c9a8a |
institution | Directory Open Access Journal |
issn | 2504-4990 |
language | English |
last_indexed | 2024-03-10T22:32:30Z |
publishDate | 2023-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Machine Learning and Knowledge Extraction |
spelling | doaj.art-f0e855c71a194b90bc4823f2223c9a8a2023-11-19T11:41:38ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902023-07-015384786710.3390/make5030045Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based ModelsAla Alam Falaki0Robin Gras1School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaSchool of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaThis paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.https://www.mdpi.com/2504-4990/5/3/45transformersautoencoder (AE)sequence-to-sequence (seq2seq)compressionsummarizationtranslation |
spellingShingle | Ala Alam Falaki Robin Gras Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models Machine Learning and Knowledge Extraction transformers autoencoder (AE) sequence-to-sequence (seq2seq) compression summarization translation |
title | Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models |
title_full | Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models |
title_fullStr | Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models |
title_full_unstemmed | Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models |
title_short | Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models |
title_sort | efficient latent space compression for lightning fast fine tuning and inference of transformer based models |
topic | transformers autoencoder (AE) sequence-to-sequence (seq2seq) compression summarization translation |
url | https://www.mdpi.com/2504-4990/5/3/45 |
work_keys_str_mv | AT alaalamfalaki efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels AT robingras efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels |