Sentence‐Chain Based Seq2seq Model for Corpus Expansion
This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for ad...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Electronics and Telecommunications Research Institute (ETRI)
2017-08-01
|
Series: | ETRI Journal |
Subjects: | |
Online Access: | https://doi.org/10.4218/etrij.17.0116.0074 |
_version_ | 1811213139380797440 |
---|---|
author | Euisok Chung Jeon Gue Park |
author_facet | Euisok Chung Jeon Gue Park |
author_sort | Euisok Chung |
collection | DOAJ |
description | This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence‐chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4‐times the number of n‐grams with superior performance for English text. |
first_indexed | 2024-04-12T05:40:45Z |
format | Article |
id | doaj.art-2783d2c5ac17465caa8a35ab86efb419 |
institution | Directory Open Access Journal |
issn | 1225-6463 2233-7326 |
language | English |
last_indexed | 2024-04-12T05:40:45Z |
publishDate | 2017-08-01 |
publisher | Electronics and Telecommunications Research Institute (ETRI) |
record_format | Article |
series | ETRI Journal |
spelling | doaj.art-2783d2c5ac17465caa8a35ab86efb4192022-12-22T03:45:39ZengElectronics and Telecommunications Research Institute (ETRI)ETRI Journal1225-64632233-73262017-08-0139445546610.4218/etrij.17.0116.007410.4218/etrij.17.0116.0074Sentence‐Chain Based Seq2seq Model for Corpus ExpansionEuisok ChungJeon Gue ParkThis study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence‐chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4‐times the number of n‐grams with superior performance for English text.https://doi.org/10.4218/etrij.17.0116.0074Sentence chainLexical chainSeq2seq modelCorpus expansion |
spellingShingle | Euisok Chung Jeon Gue Park Sentence‐Chain Based Seq2seq Model for Corpus Expansion ETRI Journal Sentence chain Lexical chain Seq2seq model Corpus expansion |
title | Sentence‐Chain Based Seq2seq Model for Corpus Expansion |
title_full | Sentence‐Chain Based Seq2seq Model for Corpus Expansion |
title_fullStr | Sentence‐Chain Based Seq2seq Model for Corpus Expansion |
title_full_unstemmed | Sentence‐Chain Based Seq2seq Model for Corpus Expansion |
title_short | Sentence‐Chain Based Seq2seq Model for Corpus Expansion |
title_sort | sentence chain based seq2seq model for corpus expansion |
topic | Sentence chain Lexical chain Seq2seq model Corpus expansion |
url | https://doi.org/10.4218/etrij.17.0116.0074 |
work_keys_str_mv | AT euisokchung sentencechainbasedseq2seqmodelforcorpusexpansion AT jeonguepark sentencechainbasedseq2seqmodelforcorpusexpansion |