Double Consistency Regularization for Transformer Networks

The large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without dis...

Full description

Bibliographic Details
Main Authors: Yuxian Wan, Wenlin Zhang, Zhen Li
Format: Article
Language:English
Published: MDPI AG 2023-10-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/20/4357
_version_ 1797573953667989504
author Yuxian Wan
Wenlin Zhang
Zhen Li
author_facet Yuxian Wan
Wenlin Zhang
Zhen Li
author_sort Yuxian Wan
collection DOAJ
description The large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without disturbance. In this work, we propose a double consistency regularization (DOCR) method for the end-to-end model structure, which separately constrains the output of the encoder and decoder during the training process to alleviate the above problems. Specifically, on the basis of the cross-entropy loss function, we build the mean model by integrating the model parameters of the previous rounds and measure the consistency between the models by calculating the KL divergence between the features of the encoder output and the probability distribution of the decoder output of the mean model and the base model so as to impose regularization constraints on the solution space of the model. We conducted extensive experiments on machine translation tasks, and the results show that the BLEU score increased by 2.60 on average, demonstrating the effectiveness of DOCR in improving model performance and its complementary impacts with other regularization techniques.
first_indexed 2024-03-10T21:17:19Z
format Article
id doaj.art-2c9c72bd06e445cf9cb52aa5beb7e921
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T21:17:19Z
publishDate 2023-10-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-2c9c72bd06e445cf9cb52aa5beb7e9212023-11-19T16:20:36ZengMDPI AGElectronics2079-92922023-10-011220435710.3390/electronics12204357Double Consistency Regularization for Transformer NetworksYuxian Wan0Wenlin Zhang1Zhen Li2School of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaSchool of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaSchool of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaThe large-scale and deep-layer deep neural network based on the Transformer model is very powerful in sequence tasks, but it is prone to overfitting for small-scale training data. Moreover, the prediction result of the model with a small disturbance input is significantly lower than that without disturbance. In this work, we propose a double consistency regularization (DOCR) method for the end-to-end model structure, which separately constrains the output of the encoder and decoder during the training process to alleviate the above problems. Specifically, on the basis of the cross-entropy loss function, we build the mean model by integrating the model parameters of the previous rounds and measure the consistency between the models by calculating the KL divergence between the features of the encoder output and the probability distribution of the decoder output of the mean model and the base model so as to impose regularization constraints on the solution space of the model. We conducted extensive experiments on machine translation tasks, and the results show that the BLEU score increased by 2.60 on average, demonstrating the effectiveness of DOCR in improving model performance and its complementary impacts with other regularization techniques.https://www.mdpi.com/2079-9292/12/20/4357cross-entropy lossdeep neural networkKL divergenceoverfittingtransformerregularization
spellingShingle Yuxian Wan
Wenlin Zhang
Zhen Li
Double Consistency Regularization for Transformer Networks
Electronics
cross-entropy loss
deep neural network
KL divergence
overfitting
transformer
regularization
title Double Consistency Regularization for Transformer Networks
title_full Double Consistency Regularization for Transformer Networks
title_fullStr Double Consistency Regularization for Transformer Networks
title_full_unstemmed Double Consistency Regularization for Transformer Networks
title_short Double Consistency Regularization for Transformer Networks
title_sort double consistency regularization for transformer networks
topic cross-entropy loss
deep neural network
KL divergence
overfitting
transformer
regularization
url https://www.mdpi.com/2079-9292/12/20/4357
work_keys_str_mv AT yuxianwan doubleconsistencyregularizationfortransformernetworks
AT wenlinzhang doubleconsistencyregularizationfortransformernetworks
AT zhenli doubleconsistencyregularizationfortransformernetworks