A scoping review of preprocessing methods for unstructured text data to assess data quality

Introduction Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preproc...

Full description

Bibliographic Details
Main Authors:	Marcello Nesca, Alan Katz, Carson Leung, Lisa Lix
Format:	Article
Language:	English
Published:	Swansea University 2022-10-01
Series:	International Journal of Population Data Science
Subjects:	Review Data Quality Natural Language Processing
Online Access:	https://ijpds.org/article/view/1757

_version_	1797430641261805568
author	Marcello Nesca Alan Katz Carson Leung Lisa Lix
author_facet	Marcello Nesca Alan Katz Carson Leung Lisa Lix
author_sort	Marcello Nesca
collection	DOAJ
description	Introduction Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. Objective Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. Methods A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. Results A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. Conclusions Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.
first_indexed	2024-03-09T09:31:28Z
format	Article
id	doaj.art-f1db3b487f284b1f8afab7d1b985fb9e
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T09:31:28Z
publishDate	2022-10-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-f1db3b487f284b1f8afab7d1b985fb9e2023-12-02T03:57:42ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-10-017110.23889/ijpds.v7i1.1757A scoping review of preprocessing methods for unstructured text data to assess data qualityMarcello Nesca0Alan Katz1Carson Leung2Lisa Lix3Department of Community Health Sciences, University of Manitoba, Winnipeg, MB, Canada; Manitoba Centre for Health Policy, University of Manitoba, Winnipeg, MB, CanadaDepartment of Community Health Sciences, University of Manitoba, Winnipeg, MB, Canada; Manitoba Centre for Health Policy, University of Manitoba, Winnipeg, MB, Canada; Department of Family Medicine, University of Manitoba, Winnipeg, MB, CanadaDepartment of Computer Science, University of Manitoba, Winnipeg, MB, CanadaDepartment of Community Health Sciences, University of Manitoba, Winnipeg, MB, Canada; Manitoba Centre for Health Policy, University of Manitoba, Winnipeg, MB, Canada; George & Fay Yee Centre for Healthcare Innovation, University of Manitoba, Winnipeg, MB, CanadaIntroduction Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. Objective Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. Methods A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. Results A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. Conclusions Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.https://ijpds.org/article/view/1757ReviewData QualityNatural Language Processing
spellingShingle	Marcello Nesca Alan Katz Carson Leung Lisa Lix A scoping review of preprocessing methods for unstructured text data to assess data quality International Journal of Population Data Science Review Data Quality Natural Language Processing
title	A scoping review of preprocessing methods for unstructured text data to assess data quality
title_full	A scoping review of preprocessing methods for unstructured text data to assess data quality
title_fullStr	A scoping review of preprocessing methods for unstructured text data to assess data quality
title_full_unstemmed	A scoping review of preprocessing methods for unstructured text data to assess data quality
title_short	A scoping review of preprocessing methods for unstructured text data to assess data quality
title_sort	scoping review of preprocessing methods for unstructured text data to assess data quality
topic	Review Data Quality Natural Language Processing
url	https://ijpds.org/article/view/1757
work_keys_str_mv	AT marcellonesca ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT alankatz ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT carsonleung ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT lisalix ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT marcellonesca scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT alankatz scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT carsonleung scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality AT lisalix scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality

A scoping review of preprocessing methods for unstructured text data to assess data quality

Similar Items