Practical Text Phylogeny for Real-World Settings
The ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2018-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8412174/ |
_version_ | 1818618298226442240 |
---|---|
author | Bingyu Shen Christopher W. Forstall Anderson De Rezende Rocha Walter J. Scheirer |
author_facet | Bingyu Shen Christopher W. Forstall Anderson De Rezende Rocha Walter J. Scheirer |
author_sort | Bingyu Shen |
collection | DOAJ |
description | The ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation on the other, the problem of identifying the relationship between texts with similar content is becoming more important. Traditional vector space representations of texts have made progress in solving this problem when it is cast as a reconstruction task that organizes related texts into a tree expressing relationships-this is dubbed text phylogeny in the information forensics literature. However, as new text representation methods have been successfully applied to many other text analysis problems, it is worth investigating if they too are used in text phylogeny tree reconstruction. In this paper, we explore the use of word embeddings as a text representation method, with the aim of trying to improve the accuracy of reconstructed phylogeny trees for real-world data and compare it with other widely used text representation methods. We evaluate the performance on established benchmarks for this task: a synthetic data set and data collected from Wikipedia. We also apply our framework to a new data set of fan fiction based on some famous fairy tales. Experimental results show that word embeddings are competitive with other feature sets for the published benchmarks, and are highly effective for creative writing. |
first_indexed | 2024-12-16T17:19:22Z |
format | Article |
id | doaj.art-a86d24e7a9314693b87aca97a60a21b0 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-16T17:19:22Z |
publishDate | 2018-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-a86d24e7a9314693b87aca97a60a21b02022-12-21T22:23:12ZengIEEEIEEE Access2169-35362018-01-016410024101210.1109/ACCESS.2018.28568658412174Practical Text Phylogeny for Real-World SettingsBingyu Shen0https://orcid.org/0000-0002-0792-7904Christopher W. Forstall1Anderson De Rezende Rocha2Walter J. Scheirer3Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USADepartment of Classics, Mount Allison University, Sackville, CanadaInstitute of Computing, University of Campinas, Campinas, BrazilDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USAThe ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation on the other, the problem of identifying the relationship between texts with similar content is becoming more important. Traditional vector space representations of texts have made progress in solving this problem when it is cast as a reconstruction task that organizes related texts into a tree expressing relationships-this is dubbed text phylogeny in the information forensics literature. However, as new text representation methods have been successfully applied to many other text analysis problems, it is worth investigating if they too are used in text phylogeny tree reconstruction. In this paper, we explore the use of word embeddings as a text representation method, with the aim of trying to improve the accuracy of reconstructed phylogeny trees for real-world data and compare it with other widely used text representation methods. We evaluate the performance on established benchmarks for this task: a synthetic data set and data collected from Wikipedia. We also apply our framework to a new data set of fan fiction based on some famous fairy tales. Experimental results show that word embeddings are competitive with other feature sets for the published benchmarks, and are highly effective for creative writing.https://ieeexplore.ieee.org/document/8412174/Text phylogenyword embeddingsmachine learningnatural language processingforensicsdigital humanities |
spellingShingle | Bingyu Shen Christopher W. Forstall Anderson De Rezende Rocha Walter J. Scheirer Practical Text Phylogeny for Real-World Settings IEEE Access Text phylogeny word embeddings machine learning natural language processing forensics digital humanities |
title | Practical Text Phylogeny for Real-World Settings |
title_full | Practical Text Phylogeny for Real-World Settings |
title_fullStr | Practical Text Phylogeny for Real-World Settings |
title_full_unstemmed | Practical Text Phylogeny for Real-World Settings |
title_short | Practical Text Phylogeny for Real-World Settings |
title_sort | practical text phylogeny for real world settings |
topic | Text phylogeny word embeddings machine learning natural language processing forensics digital humanities |
url | https://ieeexplore.ieee.org/document/8412174/ |
work_keys_str_mv | AT bingyushen practicaltextphylogenyforrealworldsettings AT christopherwforstall practicaltextphylogenyforrealworldsettings AT andersonderezenderocha practicaltextphylogenyforrealworldsettings AT walterjscheirer practicaltextphylogenyforrealworldsettings |