Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient

Neural machine translation (NMT) systems have greatly improved the quality available from machine translation (MT) compared to statistical machine translation (SMT) systems. However, these state-of-the-art NMT models need much more computing power and data than SMT models, a requirement that is unsu...

Full description

Bibliographic Details
Main Authors:	Wandri Jooste, Rejwanul Haque, Andy Way
Format:	Article
Language:	English
Published:	MDPI AG 2022-02-01
Series:	Information
Subjects:	NMT Green AI knowledge distillation CO<sub>2</sub> savings
Online Access:	https://www.mdpi.com/2078-2489/13/2/88

_version_	1827654450935758848
author	Wandri Jooste Rejwanul Haque Andy Way
author_facet	Wandri Jooste Rejwanul Haque Andy Way
author_sort	Wandri Jooste
collection	DOAJ
description	Neural machine translation (NMT) systems have greatly improved the quality available from machine translation (MT) compared to statistical machine translation (SMT) systems. However, these state-of-the-art NMT models need much more computing power and data than SMT models, a requirement that is unsustainable in the long run and of very limited benefit in low-resource scenarios. To some extent, model compression—more specifically state-of-the-art knowledge distillation techniques—can remedy this. In this work, we investigate knowledge distillation on a simulated low-resource German-to-English translation task. We show that sequence-level knowledge distillation can be used to train small student models on knowledge distilled from large teacher models. Part of this work examines the influence of hyperparameter tuning on model performance when lowering the number of Transformer heads or limiting the vocabulary size. Interestingly, the accuracy of these student models is higher than that of the teachers in some cases even though the student model training times are shorter in some cases. In a novel contribution, we demonstrate for a specific MT service provider that in the post-deployment phase, distilled student models can reduce emissions, as well as cost purely in monetary terms, by almost 50%.
first_indexed	2024-03-09T21:43:17Z
format	Article
id	doaj.art-9f0b589c1a6c4230a4b6836bec89160d
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-09T21:43:17Z
publishDate	2022-02-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-9f0b589c1a6c4230a4b6836bec89160d2023-11-23T20:25:38ZengMDPI AGInformation2078-24892022-02-011328810.3390/info13020088Knowledge Distillation: A Method for Making Neural Machine Translation More EfficientWandri Jooste0Rejwanul Haque1Andy Way2ML-Labs, ADAPT Centre, Dublin City University, D09 Y074 Dublin, IrelandSchool of Computing, National College of Ireland, D01 Y300 Dublin, IrelandML-Labs, ADAPT Centre, Dublin City University, D09 Y074 Dublin, IrelandNeural machine translation (NMT) systems have greatly improved the quality available from machine translation (MT) compared to statistical machine translation (SMT) systems. However, these state-of-the-art NMT models need much more computing power and data than SMT models, a requirement that is unsustainable in the long run and of very limited benefit in low-resource scenarios. To some extent, model compression—more specifically state-of-the-art knowledge distillation techniques—can remedy this. In this work, we investigate knowledge distillation on a simulated low-resource German-to-English translation task. We show that sequence-level knowledge distillation can be used to train small student models on knowledge distilled from large teacher models. Part of this work examines the influence of hyperparameter tuning on model performance when lowering the number of Transformer heads or limiting the vocabulary size. Interestingly, the accuracy of these student models is higher than that of the teachers in some cases even though the student model training times are shorter in some cases. In a novel contribution, we demonstrate for a specific MT service provider that in the post-deployment phase, distilled student models can reduce emissions, as well as cost purely in monetary terms, by almost 50%.https://www.mdpi.com/2078-2489/13/2/88NMTGreen AIknowledge distillationCO<sub>2</sub> savings
spellingShingle	Wandri Jooste Rejwanul Haque Andy Way Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient Information NMT Green AI knowledge distillation CO<sub>2</sub> savings
title	Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
title_full	Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
title_fullStr	Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
title_full_unstemmed	Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
title_short	Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
title_sort	knowledge distillation a method for making neural machine translation more efficient
topic	NMT Green AI knowledge distillation CO<sub>2</sub> savings
url	https://www.mdpi.com/2078-2489/13/2/88
work_keys_str_mv	AT wandrijooste knowledgedistillationamethodformakingneuralmachinetranslationmoreefficient AT rejwanulhaque knowledgedistillationamethodformakingneuralmachinetranslationmoreefficient AT andyway knowledgedistillationamethodformakingneuralmachinetranslationmoreefficient

Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient

Similar Items