Semantic Coherence Dataset: Speech transcripts

The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally...

Full description

Bibliographic Details
Main Authors:	Davide Colla, Matteo Delsanto, Daniele P. Radicioni
Format:	Article
Language:	English
Published:	Elsevier 2023-02-01
Series:	Data in Brief
Subjects:	Perplexity metrics Intra-subject semantic reliability Inter-subject semantic reliability Language models Speech transcripts Spoken language analysis
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340922010022

_version_	1797938128918413312
author	Davide Colla Matteo Delsanto Daniele P. Radicioni
author_facet	Davide Colla Matteo Delsanto Daniele P. Radicioni
author_sort	Davide Colla
collection	DOAJ
description	The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally conceived as an information-theoretic measure to assess the probabilistic inference properties of language models, has recently been proven to be an appropriate tool to categorize speech transcripts based on semantic coherence accounts. More specifically, perplexity has been successfully employed to discriminate subjects suffering from Alzheimer Disease and healthy controls. Collected data include speech transcripts, intended to investigate semantic coherence at different levels: data are thus arranged into two classes, to investigate intra-subject semantic coherence, and inter-subject semantic coherence. In the former case transcripts from a single speaker can be employed to train and test language models and to explore whether the perplexity metric provides stable scores in assessing talks from that speaker, while allowing to distinguish between two different forms of speech, political rallies and interviews. In the latter case, models can be trained by employing transcripts from a given speaker, and then used to measure how stable the perplexity metric is when computed using the model from that user and transcripts from different users. Transcripts were extracted from talks lasting almost 13 hours (overall 12:45:17 and 120,326 tokens) for the former class; and almost 30 hours (29:47:34 and 252,270 tokens) for the latter one. Data herein can be reused to perform analyses on measures built on top of language models, and more in general on measures that are aimed at exploring the linguistic features of text documents.
first_indexed	2024-04-10T18:54:52Z
format	Article
id	doaj.art-9680c9011a74489aaf083c8708da75dd
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-04-10T18:54:52Z
publishDate	2023-02-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-9680c9011a74489aaf083c8708da75dd2023-02-01T04:26:07ZengElsevierData in Brief2352-34092023-02-0146108799Semantic Coherence Dataset: Speech transcriptsDavide Colla0Matteo Delsanto1Daniele P. Radicioni2University of Turin, ItalyUniversity of Turin, ItalyCorresponding author.; University of Turin, ItalyThe Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally conceived as an information-theoretic measure to assess the probabilistic inference properties of language models, has recently been proven to be an appropriate tool to categorize speech transcripts based on semantic coherence accounts. More specifically, perplexity has been successfully employed to discriminate subjects suffering from Alzheimer Disease and healthy controls. Collected data include speech transcripts, intended to investigate semantic coherence at different levels: data are thus arranged into two classes, to investigate intra-subject semantic coherence, and inter-subject semantic coherence. In the former case transcripts from a single speaker can be employed to train and test language models and to explore whether the perplexity metric provides stable scores in assessing talks from that speaker, while allowing to distinguish between two different forms of speech, political rallies and interviews. In the latter case, models can be trained by employing transcripts from a given speaker, and then used to measure how stable the perplexity metric is when computed using the model from that user and transcripts from different users. Transcripts were extracted from talks lasting almost 13 hours (overall 12:45:17 and 120,326 tokens) for the former class; and almost 30 hours (29:47:34 and 252,270 tokens) for the latter one. Data herein can be reused to perform analyses on measures built on top of language models, and more in general on measures that are aimed at exploring the linguistic features of text documents.http://www.sciencedirect.com/science/article/pii/S2352340922010022Perplexity metricsIntra-subject semantic reliabilityInter-subject semantic reliabilityLanguage modelsSpeech transcriptsSpoken language analysis
spellingShingle	Davide Colla Matteo Delsanto Daniele P. Radicioni Semantic Coherence Dataset: Speech transcripts Data in Brief Perplexity metrics Intra-subject semantic reliability Inter-subject semantic reliability Language models Speech transcripts Spoken language analysis
title	Semantic Coherence Dataset: Speech transcripts
title_full	Semantic Coherence Dataset: Speech transcripts
title_fullStr	Semantic Coherence Dataset: Speech transcripts
title_full_unstemmed	Semantic Coherence Dataset: Speech transcripts
title_short	Semantic Coherence Dataset: Speech transcripts
title_sort	semantic coherence dataset speech transcripts
topic	Perplexity metrics Intra-subject semantic reliability Inter-subject semantic reliability Language models Speech transcripts Spoken language analysis
url	http://www.sciencedirect.com/science/article/pii/S2352340922010022
work_keys_str_mv	AT davidecolla semanticcoherencedatasetspeechtranscripts AT matteodelsanto semanticcoherencedatasetspeechtranscripts AT danielepradicioni semanticcoherencedatasetspeechtranscripts

Semantic Coherence Dataset: Speech transcripts

Similar Items