Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models

Suicidal ideation detection is a vital research area that holds great potential for improving mental health support systems. However, the sensitivity surrounding suicide-related data poses challenges in accessing large-scale, annotated datasets necessary for training effective machine learning model...

Full description

Bibliographic Details
Main Authors:	Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Artificial intelligence deep learning large language models suicide detection synthetic data generation transformer based models
Online Access:	https://ieeexplore.ieee.org/document/10413447/

_version_	1797335521608859648
author	Hamideh Ghanadian Isar Nejadgholi Hussein Al Osman
author_facet	Hamideh Ghanadian Isar Nejadgholi Hussein Al Osman
author_sort	Hamideh Ghanadian
collection	DOAJ
description	Suicidal ideation detection is a vital research area that holds great potential for improving mental health support systems. However, the sensitivity surrounding suicide-related data poses challenges in accessing large-scale, annotated datasets necessary for training effective machine learning models. To address this limitation, we introduce an innovative strategy that leverages the capabilities of generative AI models, such as ChatGPT, Flan-T5, and Llama, to create synthetic data for suicidal ideation detection. Our data generation approach is grounded in social factors extracted from psychology literature and aims to ensure coverage of essential information related to suicidal ideation. In our study, we benchmarked against state-of-the-art NLP classification models, specifically, those centered around the BERT family structures. When trained on the real-world dataset, UMD, these conventional models tend to yield F1-scores ranging from 0.75 to 0.87. Our synthetic data-driven method, informed by social factors, offers consistent F1-scores of 0.82 for both models, suggesting that the richness of topics in synthetic data can bridge the performance gap across different model complexities. Most impressively, when we combined a mere 30% of the UMD dataset with our synthetic data, we witnessed a substantial increase in performance, achieving an F1-score of 0.88 on the UMD test set. Such results underscore the cost-effectiveness and potential of our approach in confronting major challenges in the field, such as data scarcity and the quest for diversity in data representation.
first_indexed	2024-03-08T08:39:29Z
format	Article
id	doaj.art-9b409c1d5d9d488e9cafe8e1172c76ca
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T08:39:29Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-9b409c1d5d9d488e9cafe8e1172c76ca2024-02-02T00:04:22ZengIEEEIEEE Access2169-35362024-01-0112143501436310.1109/ACCESS.2024.335820610413447Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language ModelsHamideh Ghanadian0https://orcid.org/0000-0002-5203-3504Isar Nejadgholi1Hussein Al Osman2https://orcid.org/0000-0002-7189-5644Department of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, CanadaNational Research Council Canada, Ottawa, ON, CanadaDepartment of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, CanadaSuicidal ideation detection is a vital research area that holds great potential for improving mental health support systems. However, the sensitivity surrounding suicide-related data poses challenges in accessing large-scale, annotated datasets necessary for training effective machine learning models. To address this limitation, we introduce an innovative strategy that leverages the capabilities of generative AI models, such as ChatGPT, Flan-T5, and Llama, to create synthetic data for suicidal ideation detection. Our data generation approach is grounded in social factors extracted from psychology literature and aims to ensure coverage of essential information related to suicidal ideation. In our study, we benchmarked against state-of-the-art NLP classification models, specifically, those centered around the BERT family structures. When trained on the real-world dataset, UMD, these conventional models tend to yield F1-scores ranging from 0.75 to 0.87. Our synthetic data-driven method, informed by social factors, offers consistent F1-scores of 0.82 for both models, suggesting that the richness of topics in synthetic data can bridge the performance gap across different model complexities. Most impressively, when we combined a mere 30% of the UMD dataset with our synthetic data, we witnessed a substantial increase in performance, achieving an F1-score of 0.88 on the UMD test set. Such results underscore the cost-effectiveness and potential of our approach in confronting major challenges in the field, such as data scarcity and the quest for diversity in data representation.https://ieeexplore.ieee.org/document/10413447/Artificial intelligencedeep learninglarge language modelssuicide detectionsynthetic data generationtransformer based models
spellingShingle	Hamideh Ghanadian Isar Nejadgholi Hussein Al Osman Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models IEEE Access Artificial intelligence deep learning large language models suicide detection synthetic data generation transformer based models
title	Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
title_full	Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
title_fullStr	Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
title_full_unstemmed	Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
title_short	Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
title_sort	socially aware synthetic data generation for suicidal ideation detection using large language models
topic	Artificial intelligence deep learning large language models suicide detection synthetic data generation transformer based models
url	https://ieeexplore.ieee.org/document/10413447/
work_keys_str_mv	AT hamidehghanadian sociallyawaresyntheticdatagenerationforsuicidalideationdetectionusinglargelanguagemodels AT isarnejadgholi sociallyawaresyntheticdatagenerationforsuicidalideationdetectionusinglargelanguagemodels AT husseinalosman sociallyawaresyntheticdatagenerationforsuicidalideationdetectionusinglargelanguagemodels

Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models

Similar Items