Paragraph Boundary Recognition in Novels for Story Understanding

The understanding of narrative stories by computer is an important task for their automatic generation. To date, high-performance neural-network technologies such as BERT have been applied to tasks such as the Story Cloze Test and Story Completion. In this study, we focus on the text segmentation of...

Full description

Bibliographic Details
Main Authors: Riku Iikura, Makoto Okada, Naoki Mori
Format: Article
Language:English
Published: MDPI AG 2021-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/12/5632
_version_ 1797529679526100992
author Riku Iikura
Makoto Okada
Naoki Mori
author_facet Riku Iikura
Makoto Okada
Naoki Mori
author_sort Riku Iikura
collection DOAJ
description The understanding of narrative stories by computer is an important task for their automatic generation. To date, high-performance neural-network technologies such as BERT have been applied to tasks such as the Story Cloze Test and Story Completion. In this study, we focus on the text segmentation of novels into paragraphs, which is an important writing technique for readers to deepen their understanding of the texts. This type of segmentation, which we call “paragraph boundary recognition”, can be considered to be a binary classification problem in terms of the presence or absence of a boundary, such as a paragraph between target sentences. However, in this case, the data imbalance becomes a bottleneck because the number of paragraphs is generally smaller than the number of sentences. To deal with this problem, we introduced several cost-sensitive loss functions, namely. focal loss, dice loss, and anchor loss, which were robust for imbalanced classification in BERT. In addition, introducing the threshold-moving technique into the model was effective in estimating paragraph boundaries. As a result of the experiment on three newly created datasets, BERT with dice loss and threshold moving obtained a higher <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>F</mi><mn>1</mn></mrow></semantics></math></inline-formula> than the original BERT had using cross-entropy loss as its loss function (76% to 80%, 50% to 54%, 59% to 63%).
first_indexed 2024-03-10T10:17:12Z
format Article
id doaj.art-0e5797fc59404588af49350884a2702a
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T10:17:12Z
publishDate 2021-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-0e5797fc59404588af49350884a2702a2023-11-22T00:39:59ZengMDPI AGApplied Sciences2076-34172021-06-011112563210.3390/app11125632Paragraph Boundary Recognition in Novels for Story UnderstandingRiku Iikura0Makoto Okada1Naoki Mori2Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Naka-ku, Sakai, Osaka 599-8231, JapanGraduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Naka-ku, Sakai, Osaka 599-8231, JapanGraduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Naka-ku, Sakai, Osaka 599-8231, JapanThe understanding of narrative stories by computer is an important task for their automatic generation. To date, high-performance neural-network technologies such as BERT have been applied to tasks such as the Story Cloze Test and Story Completion. In this study, we focus on the text segmentation of novels into paragraphs, which is an important writing technique for readers to deepen their understanding of the texts. This type of segmentation, which we call “paragraph boundary recognition”, can be considered to be a binary classification problem in terms of the presence or absence of a boundary, such as a paragraph between target sentences. However, in this case, the data imbalance becomes a bottleneck because the number of paragraphs is generally smaller than the number of sentences. To deal with this problem, we introduced several cost-sensitive loss functions, namely. focal loss, dice loss, and anchor loss, which were robust for imbalanced classification in BERT. In addition, introducing the threshold-moving technique into the model was effective in estimating paragraph boundaries. As a result of the experiment on three newly created datasets, BERT with dice loss and threshold moving obtained a higher <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>F</mi><mn>1</mn></mrow></semantics></math></inline-formula> than the original BERT had using cross-entropy loss as its loss function (76% to 80%, 50% to 54%, 59% to 63%).https://www.mdpi.com/2076-3417/11/12/5632natural-language processingstory understandingtext segmentationimbalanced classificationBERTcost-sensitive loss
spellingShingle Riku Iikura
Makoto Okada
Naoki Mori
Paragraph Boundary Recognition in Novels for Story Understanding
Applied Sciences
natural-language processing
story understanding
text segmentation
imbalanced classification
BERT
cost-sensitive loss
title Paragraph Boundary Recognition in Novels for Story Understanding
title_full Paragraph Boundary Recognition in Novels for Story Understanding
title_fullStr Paragraph Boundary Recognition in Novels for Story Understanding
title_full_unstemmed Paragraph Boundary Recognition in Novels for Story Understanding
title_short Paragraph Boundary Recognition in Novels for Story Understanding
title_sort paragraph boundary recognition in novels for story understanding
topic natural-language processing
story understanding
text segmentation
imbalanced classification
BERT
cost-sensitive loss
url https://www.mdpi.com/2076-3417/11/12/5632
work_keys_str_mv AT rikuiikura paragraphboundaryrecognitioninnovelsforstoryunderstanding
AT makotookada paragraphboundaryrecognitioninnovelsforstoryunderstanding
AT naokimori paragraphboundaryrecognitioninnovelsforstoryunderstanding