Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing

Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of gene...

Full description

Bibliographic Details
Main Authors: Yiqi Wang, Jinmei Zuo, Chao Duan, Hao Peng, Jia Huang, Liang Zhao, Li Zhang, Zhiqiang Dong
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S200103702400014X
_version_ 1797324308211564544
author Yiqi Wang
Jinmei Zuo
Chao Duan
Hao Peng
Jia Huang
Liang Zhao
Li Zhang
Zhiqiang Dong
author_facet Yiqi Wang
Jinmei Zuo
Chao Duan
Hao Peng
Jia Huang
Liang Zhao
Li Zhang
Zhiqiang Dong
author_sort Yiqi Wang
collection DOAJ
description Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of genetic data related to mutations. Furthermore, numerous other molecules associated with CCM have been unearthed. However, tackling such massive volumes of unstructured data remains challenging until the advent of advanced large language models. In this study, we developed an automated analytical pipeline specialized in single nucleotide variants (SNVs) related biomedical text analysis called BRLM. To facilitate this, BioBERT was employed to vectorize the rich information of SNVs, while a deep residue network was used to discriminate the classes of the SNVs. BRLM was initially constructed on mutations from 12 different types of TCGA cancers, achieving an accuracy exceeding 99%. It was further examined for CCM mutations in familial sequencing data analysis, highlighting an upstream master regulator gene fibroblast growth factor 1 (FGF1). With multi-omics characterization and validation in biological function, FGF1 demonstrated to play a significant role in the development of CCMs, which proved the effectiveness of our model. The BRLM web server is available at http://1.117.230.196.
first_indexed 2024-03-08T05:54:11Z
format Article
id doaj.art-02a11133a81840af86234ca6ff76dc57
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-03-08T05:54:11Z
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-02a11133a81840af86234ca6ff76dc572024-02-05T04:31:42ZengElsevierComputational and Structural Biotechnology Journal2001-03702024-12-0123843858Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencingYiqi Wang0Jinmei Zuo1Chao Duan2Hao Peng3Jia Huang4Liang Zhao5Li Zhang6Zhiqiang Dong7College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China; Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China; Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, ChinaPhysical Examination Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, ChinaCollege of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China; Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, ChinaCenter for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China; Department of Neurosurgery, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, ChinaThe Second Clinical Medical College, Lanzhou University, No. 222, South Tianshui Road, Lanzhou 730030, Gansu, ChinaPrecision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, China; Corresponding author.Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China; Department of Neurosurgery, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China; Corresponding author at: Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China.College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China; Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China; Corresponding author at: College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China.Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of genetic data related to mutations. Furthermore, numerous other molecules associated with CCM have been unearthed. However, tackling such massive volumes of unstructured data remains challenging until the advent of advanced large language models. In this study, we developed an automated analytical pipeline specialized in single nucleotide variants (SNVs) related biomedical text analysis called BRLM. To facilitate this, BioBERT was employed to vectorize the rich information of SNVs, while a deep residue network was used to discriminate the classes of the SNVs. BRLM was initially constructed on mutations from 12 different types of TCGA cancers, achieving an accuracy exceeding 99%. It was further examined for CCM mutations in familial sequencing data analysis, highlighting an upstream master regulator gene fibroblast growth factor 1 (FGF1). With multi-omics characterization and validation in biological function, FGF1 demonstrated to play a significant role in the development of CCMs, which proved the effectiveness of our model. The BRLM web server is available at http://1.117.230.196.http://www.sciencedirect.com/science/article/pii/S200103702400014XWhole genome sequencingCerebral cavernous malformationDeep learningLarge language modelNatural language processing
spellingShingle Yiqi Wang
Jinmei Zuo
Chao Duan
Hao Peng
Jia Huang
Liang Zhao
Li Zhang
Zhiqiang Dong
Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
Computational and Structural Biotechnology Journal
Whole genome sequencing
Cerebral cavernous malformation
Deep learning
Large language model
Natural language processing
title Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
title_full Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
title_fullStr Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
title_full_unstemmed Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
title_short Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
title_sort large language models assisted multi effect variants mining on cerebral cavernous malformation familial whole genome sequencing
topic Whole genome sequencing
Cerebral cavernous malformation
Deep learning
Large language model
Natural language processing
url http://www.sciencedirect.com/science/article/pii/S200103702400014X
work_keys_str_mv AT yiqiwang largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT jinmeizuo largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT chaoduan largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT haopeng largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT jiahuang largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT liangzhao largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT lizhang largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing
AT zhiqiangdong largelanguagemodelsassistedmultieffectvariantsminingoncerebralcavernousmalformationfamilialwholegenomesequencing