Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models

This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The ex...

Full description

Bibliographic Details
Main Authors:	Ala Alam Falaki, Robin Gras
Format:	Article
Language:	English
Published:	MDPI AG 2023-07-01
Series:	Machine Learning and Knowledge Extraction
Subjects:	transformers autoencoder (AE) sequence-to-sequence (seq2seq) compression summarization translation
Online Access:	https://www.mdpi.com/2504-4990/5/3/45

_version_	1797579126623698944
author	Ala Alam Falaki Robin Gras
author_facet	Ala Alam Falaki Robin Gras
author_sort	Ala Alam Falaki
collection	DOAJ
description	This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.
first_indexed	2024-03-10T22:32:30Z
format	Article
id	doaj.art-f0e855c71a194b90bc4823f2223c9a8a
institution	Directory Open Access Journal
issn	2504-4990
language	English
last_indexed	2024-03-10T22:32:30Z
publishDate	2023-07-01
publisher	MDPI AG
record_format	Article
series	Machine Learning and Knowledge Extraction
spelling	doaj.art-f0e855c71a194b90bc4823f2223c9a8a2023-11-19T11:41:38ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902023-07-015384786710.3390/make5030045Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based ModelsAla Alam Falaki0Robin Gras1School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaSchool of Computer Science, University of Windsor, Windsor, ON N9B 3P4, CanadaThis paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.https://www.mdpi.com/2504-4990/5/3/45transformersautoencoder (AE)sequence-to-sequence (seq2seq)compressionsummarizationtranslation
spellingShingle	Ala Alam Falaki Robin Gras Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models Machine Learning and Knowledge Extraction transformers autoencoder (AE) sequence-to-sequence (seq2seq) compression summarization translation
title	Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_full	Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_fullStr	Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_full_unstemmed	Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_short	Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models
title_sort	efficient latent space compression for lightning fast fine tuning and inference of transformer based models
topic	transformers autoencoder (AE) sequence-to-sequence (seq2seq) compression summarization translation
url	https://www.mdpi.com/2504-4990/5/3/45
work_keys_str_mv	AT alaalamfalaki efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels AT robingras efficientlatentspacecompressionforlightningfastfinetuningandinferenceoftransformerbasedmodels

Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models

Similar Items