Double Consistency Regularization for Transformer Networks

The large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without dis...

Full description

Bibliographic Details
Main Authors:	Yuxian Wan, Wenlin Zhang, Zhen Li
Format:	Article
Language:	English
Published:	MDPI AG 2023-10-01
Series:	Electronics
Subjects:	cross-entropy loss deep neural network KL divergence overfitting transformer regularization
Online Access:	https://www.mdpi.com/2079-9292/12/20/4357

_version_	1797573953667989504
author	Yuxian Wan Wenlin Zhang Zhen Li
author_facet	Yuxian Wan Wenlin Zhang Zhen Li
author_sort	Yuxian Wan
collection	DOAJ
description	The large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without disturbance. In this work, we propose a double consistency regularization (DOCR) method for the end-to-end model structure, which separately constrains the output of the encoder and decoder during the training process to alleviate the above problems. Specifically, on the basis of the cross-entropy loss function, we build the mean model by integrating the model parameters of the previous rounds and measure the consistency between the models by calculating the KL divergence between the features of the encoder output and the probability distribution of the decoder output of the mean model and the base model so as to impose regularization constraints on the solution space of the model. We conducted extensive experiments on machine translation tasks, and the results show that the BLEU score increased by 2.60 on average, demonstrating the effectiveness of DOCR in improving model performance and its complementary impacts with other regularization techniques.
first_indexed	2024-03-10T21:17:19Z
format	Article
id	doaj.art-2c9c72bd06e445cf9cb52aa5beb7e921
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T21:17:19Z
publishDate	2023-10-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-2c9c72bd06e445cf9cb52aa5beb7e9212023-11-19T16:20:36ZengMDPI AGElectronics2079-92922023-10-011220435710.3390/electronics12204357Double Consistency Regularization for Transformer NetworksYuxian Wan0Wenlin Zhang1Zhen Li2School of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaSchool of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaSchool of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaThe large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without disturbance. In this work, we propose a double consistency regularization (DOCR) method for the end-to-end model structure, which separately constrains the output of the encoder and decoder during the training process to alleviate the above problems. Specifically, on the basis of the cross-entropy loss function, we build the mean model by integrating the model parameters of the previous rounds and measure the consistency between the models by calculating the KL divergence between the features of the encoder output and the probability distribution of the decoder output of the mean model and the base model so as to impose regularization constraints on the solution space of the model. We conducted extensive experiments on machine translation tasks, and the results show that the BLEU score increased by 2.60 on average, demonstrating the effectiveness of DOCR in improving model performance and its complementary impacts with other regularization techniques.https://www.mdpi.com/2079-9292/12/20/4357cross-entropy lossdeep neural networkKL divergenceoverfittingtransformerregularization
spellingShingle	Yuxian Wan Wenlin Zhang Zhen Li Double Consistency Regularization for Transformer Networks Electronics cross-entropy loss deep neural network KL divergence overfitting transformer regularization
title	Double Consistency Regularization for Transformer Networks
title_full	Double Consistency Regularization for Transformer Networks
title_fullStr	Double Consistency Regularization for Transformer Networks
title_full_unstemmed	Double Consistency Regularization for Transformer Networks
title_short	Double Consistency Regularization for Transformer Networks
title_sort	double consistency regularization for transformer networks
topic	cross-entropy loss deep neural network KL divergence overfitting transformer regularization
url	https://www.mdpi.com/2079-9292/12/20/4357
work_keys_str_mv	AT yuxianwan doubleconsistencyregularizationfortransformernetworks AT wenlinzhang doubleconsistencyregularizationfortransformernetworks AT zhenli doubleconsistencyregularizationfortransformernetworks

Double Consistency Regularization for Transformer Networks

Similar Items