A compressed large language model embedding dataset of ICD 10 CM descriptions

Abstract This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by ca...

Full description

Bibliographic Details
Main Authors:	Michael J. Kane, Casey King, Denise Esserman, Nancy K. Latham, Erich J. Greene, David A. Ganz
Format:	Article
Language:	English
Published:	BMC 2023-12-01
Series:	BMC Bioinformatics
Subjects:	Large language model Autoencoder ICD-10-CM Electronic health records EHR NLP
Online Access:	https://doi.org/10.1186/s12859-023-05597-2

_version_	1797376754155782144
author	Michael J. Kane Casey King Denise Esserman Nancy K. Latham Erich J. Greene David A. Ganz
author_facet	Michael J. Kane Casey King Denise Esserman Nancy K. Latham Erich J. Greene David A. Ganz
author_sort	Michael J. Kane
collection	DOAJ
description	Abstract This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements, download and use without any other setup. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.
first_indexed	2024-03-08T19:43:17Z
format	Article
id	doaj.art-3ee01e62b0304e67a4c9f372574dfb9f
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-03-08T19:43:17Z
publishDate	2023-12-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-3ee01e62b0304e67a4c9f372574dfb9f2023-12-24T12:30:57ZengBMCBMC Bioinformatics1471-21052023-12-0124111310.1186/s12859-023-05597-2A compressed large language model embedding dataset of ICD 10 CM descriptionsMichael J. Kane0Casey King1Denise Esserman2Nancy K. Latham3Erich J. Greene4David A. Ganz5Department of Biostatistics, School of Public Health, Yale UniversityThe Jackson School of Global Affairs, Yale UniversityDepartment of Biostatistics, School of Public Health, Yale UniversityResearch Program in Men’s Health: Aging and Metabolism, Boston Claude D. Pepper Older Americans Independence Center for Function Promoting Therapies, Brigham and Women’s HospitalDepartment of Biostatistics, School of Public Health, Yale UniversityDepartment of Medicine, VA Greater Los Angeles/UCLAAbstract This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements, download and use without any other setup. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.https://doi.org/10.1186/s12859-023-05597-2Large language modelAutoencoderICD-10-CMElectronic health recordsEHRNLP
spellingShingle	Michael J. Kane Casey King Denise Esserman Nancy K. Latham Erich J. Greene David A. Ganz A compressed large language model embedding dataset of ICD 10 CM descriptions BMC Bioinformatics Large language model Autoencoder ICD-10-CM Electronic health records EHR NLP
title	A compressed large language model embedding dataset of ICD 10 CM descriptions
title_full	A compressed large language model embedding dataset of ICD 10 CM descriptions
title_fullStr	A compressed large language model embedding dataset of ICD 10 CM descriptions
title_full_unstemmed	A compressed large language model embedding dataset of ICD 10 CM descriptions
title_short	A compressed large language model embedding dataset of ICD 10 CM descriptions
title_sort	compressed large language model embedding dataset of icd 10 cm descriptions
topic	Large language model Autoencoder ICD-10-CM Electronic health records EHR NLP
url	https://doi.org/10.1186/s12859-023-05597-2
work_keys_str_mv	AT michaeljkane acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT caseyking acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT deniseesserman acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT nancyklatham acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT erichjgreene acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT davidaganz acompressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT michaeljkane compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT caseyking compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT deniseesserman compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT nancyklatham compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT erichjgreene compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions AT davidaganz compressedlargelanguagemodelembeddingdatasetoficd10cmdescriptions

A compressed large language model embedding dataset of ICD 10 CM descriptions

Similar Items