Sentence‐Chain Based Seq2seq Model for Corpus Expansion

This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for ad...

Full description

Bibliographic Details
Main Authors: Euisok Chung, Jeon Gue Park
Format: Article
Language:English
Published: Electronics and Telecommunications Research Institute (ETRI) 2017-08-01
Series:ETRI Journal
Subjects:
Online Access:https://doi.org/10.4218/etrij.17.0116.0074
_version_ 1811213139380797440
author Euisok Chung
Jeon Gue Park
author_facet Euisok Chung
Jeon Gue Park
author_sort Euisok Chung
collection DOAJ
description This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence‐chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4‐times the number of n‐grams with superior performance for English text.
first_indexed 2024-04-12T05:40:45Z
format Article
id doaj.art-2783d2c5ac17465caa8a35ab86efb419
institution Directory Open Access Journal
issn 1225-6463
2233-7326
language English
last_indexed 2024-04-12T05:40:45Z
publishDate 2017-08-01
publisher Electronics and Telecommunications Research Institute (ETRI)
record_format Article
series ETRI Journal
spelling doaj.art-2783d2c5ac17465caa8a35ab86efb4192022-12-22T03:45:39ZengElectronics and Telecommunications Research Institute (ETRI)ETRI Journal1225-64632233-73262017-08-0139445546610.4218/etrij.17.0116.007410.4218/etrij.17.0116.0074Sentence‐Chain Based Seq2seq Model for Corpus ExpansionEuisok ChungJeon Gue ParkThis study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence‐chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4‐times the number of n‐grams with superior performance for English text.https://doi.org/10.4218/etrij.17.0116.0074Sentence chainLexical chainSeq2seq modelCorpus expansion
spellingShingle Euisok Chung
Jeon Gue Park
Sentence‐Chain Based Seq2seq Model for Corpus Expansion
ETRI Journal
Sentence chain
Lexical chain
Seq2seq model
Corpus expansion
title Sentence‐Chain Based Seq2seq Model for Corpus Expansion
title_full Sentence‐Chain Based Seq2seq Model for Corpus Expansion
title_fullStr Sentence‐Chain Based Seq2seq Model for Corpus Expansion
title_full_unstemmed Sentence‐Chain Based Seq2seq Model for Corpus Expansion
title_short Sentence‐Chain Based Seq2seq Model for Corpus Expansion
title_sort sentence chain based seq2seq model for corpus expansion
topic Sentence chain
Lexical chain
Seq2seq model
Corpus expansion
url https://doi.org/10.4218/etrij.17.0116.0074
work_keys_str_mv AT euisokchung sentencechainbasedseq2seqmodelforcorpusexpansion
AT jeonguepark sentencechainbasedseq2seqmodelforcorpusexpansion