An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset

This paper proposes a data augmentation method for imbalanced healthcare datasets. This method was inspired by a data augmentation method in natural language processing (NLP) that generates synthetic sentences for training by replacing some words with similar words. The proposed method generates syn...

Full description

Bibliographic Details
Main Authors:	Tomoki Ishikawa, Takahiro Yakoh, Hisashi Urushihara
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Adverse event prediction data augmentation distributed representation healthcare dataset imbalanced dataset
Online Access:	https://ieeexplore.ieee.org/document/9845410/

_version_	1811309119408177152
author	Tomoki Ishikawa Takahiro Yakoh Hisashi Urushihara
author_facet	Tomoki Ishikawa Takahiro Yakoh Hisashi Urushihara
author_sort	Tomoki Ishikawa
collection	DOAJ
description	This paper proposes a data augmentation method for imbalanced healthcare datasets. This method was inspired by a data augmentation method in natural language processing (NLP) that generates synthetic sentences for training by replacing some words with similar words. The proposed method generates synthetic patient records by replacing patient backgrounds with similar backgrounds. In this paper, the cosine similarity of the distributed representations was used as the similarity metric between patient backgrounds. The distributed representations of the patient backgrounds were generated by the skip-gram model. To confirm the performance improvement with the proposed data augmentation method, the prediction performance of adverse events (AEs) caused by drug administration was experimentally evaluated on a real-world medical dataset with 1,510,137 records. The combination of the proposed data augmentation method and a conventional undersampling method resulted in an 80.0% improvement in accuracy and a 40.0% improvement in the precision and F1-score. The multifaceted evaluation demonstrated that the proposed method is effective, especially for predicting AEs with positive ratios ranging from 1.0% to 2.1%, which are difficult to predict with conventional machine learning methods but should be predictable in the medical field.
first_indexed	2024-04-13T09:36:34Z
format	Article
id	doaj.art-650992bc01ef471892d400c171af0c6f
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-13T09:36:34Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-650992bc01ef471892d400c171af0c6f2022-12-22T02:52:05ZengIEEEIEEE Access2169-35362022-01-0110811668117610.1109/ACCESS.2022.31952129845410An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare DatasetTomoki Ishikawa0https://orcid.org/0000-0002-5811-3994Takahiro Yakoh1https://orcid.org/0000-0002-7031-3286Hisashi Urushihara2https://orcid.org/0000-0001-6913-9930Graduate School of Science and Technology, Keio University, Yokohama, JapanDepartment of System Design Engineering, Keio University, Yokohama, JapanDivision of Drug Development and Regulatory Science, Keio University, Tokyo, JapanThis paper proposes a data augmentation method for imbalanced healthcare datasets. This method was inspired by a data augmentation method in natural language processing (NLP) that generates synthetic sentences for training by replacing some words with similar words. The proposed method generates synthetic patient records by replacing patient backgrounds with similar backgrounds. In this paper, the cosine similarity of the distributed representations was used as the similarity metric between patient backgrounds. The distributed representations of the patient backgrounds were generated by the skip-gram model. To confirm the performance improvement with the proposed data augmentation method, the prediction performance of adverse events (AEs) caused by drug administration was experimentally evaluated on a real-world medical dataset with 1,510,137 records. The combination of the proposed data augmentation method and a conventional undersampling method resulted in an 80.0% improvement in accuracy and a 40.0% improvement in the precision and F1-score. The multifaceted evaluation demonstrated that the proposed method is effective, especially for predicting AEs with positive ratios ranging from 1.0% to 2.1%, which are difficult to predict with conventional machine learning methods but should be predictable in the medical field.https://ieeexplore.ieee.org/document/9845410/Adverse event predictiondata augmentationdistributed representationhealthcare datasetimbalanced dataset
spellingShingle	Tomoki Ishikawa Takahiro Yakoh Hisashi Urushihara An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset IEEE Access Adverse event prediction data augmentation distributed representation healthcare dataset imbalanced dataset
title	An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset
title_full	An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset
title_fullStr	An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset
title_full_unstemmed	An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset
title_short	An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset
title_sort	nlp inspired data augmentation method for adverse event prediction using an imbalanced healthcare dataset
topic	Adverse event prediction data augmentation distributed representation healthcare dataset imbalanced dataset
url	https://ieeexplore.ieee.org/document/9845410/
work_keys_str_mv	AT tomokiishikawa annlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset AT takahiroyakoh annlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset AT hisashiurushihara annlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset AT tomokiishikawa nlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset AT takahiroyakoh nlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset AT hisashiurushihara nlpinspireddataaugmentationmethodforadverseeventpredictionusinganimbalancedhealthcaredataset

An NLP-Inspired Data Augmentation Method for Adverse Event Prediction Using an Imbalanced Healthcare Dataset

Similar Items