Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models

This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The ex...

Full description

Bibliographic Details
Main Authors: Ala Alam Falaki, Robin Gras
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Machine Learning and Knowledge Extraction
Subjects:
Online Access:https://www.mdpi.com/2504-4990/5/3/45
_version_ 1797579126623698944
author Ala Alam Falaki
Robin Gras
author_facet Ala Alam Falaki
Robin Gras
author_sort Ala Alam Falaki
collection DOAJ
description This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.
first_indexed 2024-03-10T22:32:30Z
format Article
id doaj.art-f0e855c71a194b90bc4823f2223c9a8a
institution Directory Open Access Journal
issn 2504-4990
language English
last_indexed 2024-03-10T22:32:30Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Machine Learning and Knowledge Extraction
spelling doaj.art-f0e855c71a194b90bc4823f2223c9a8a2023-11-19T11:41:38ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902023-07-015384786710.3390/make5030045Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based ModelsAla Alam Falaki0Robin Gras1School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaSchool of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaThis paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.https://www.mdpi.com/2504-4990/5/3/45transformersautoencoder (AE)sequence-to-sequence (seq2seq)compressionsummarizationtranslation
spellingShingle Ala Alam Falaki
Robin Gras
Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
Machine Learning and Knowledge Extraction
transformers
autoencoder (AE)
sequence-to-sequence (seq2seq)
compression
summarization
translation
title Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_full Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_fullStr Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_full_unstemmed Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_short Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_sort efficient latent space compression for lightning fast fine tuning and inference of transformer based models
topic transformers
autoencoder (AE)
sequence-to-sequence (seq2seq)
compression
summarization
translation
url https://www.mdpi.com/2504-4990/5/3/45
work_keys_str_mv AT alaalamfalaki efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels
AT robingras efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels