Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning

Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essent...

Full description

Bibliographic Details
Main Authors: Kyubin Lee, Daejin Hyung, Soo Young Cho, Namhee Yu, Sewha Hong, Jihyun Kim, Sunshin Kim, Ji-Youn Han, Charny Park
Format: Article
Language:English
Published: Elsevier 2023-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037023000983
_version_ 1797384149125824512
author Kyubin Lee
Daejin Hyung
Soo Young Cho
Namhee Yu
Sewha Hong
Jihyun Kim
Sunshin Kim
Ji-Youn Han
Charny Park
author_facet Kyubin Lee
Daejin Hyung
Soo Young Cho
Namhee Yu
Sewha Hong
Jihyun Kim
Sunshin Kim
Ji-Youn Han
Charny Park
author_sort Kyubin Lee
collection DOAJ
description Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.
first_indexed 2024-03-08T21:31:19Z
format Article
id doaj.art-7988cf259f9e41028a2b65a58e3b840b
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-03-08T21:31:19Z
publishDate 2023-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-7988cf259f9e41028a2b65a58e3b840b2023-12-21T07:31:09ZengElsevierComputational and Structural Biotechnology Journal2001-03702023-01-012119781988Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learningKyubin Lee0Daejin Hyung1Soo Young Cho2Namhee Yu3Sewha Hong4Jihyun Kim5Sunshin Kim6Ji-Youn Han7Charny Park8Research Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USAResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaDepartment of Molecular & Life Science, Hanyang University, 55 Hanyangdaehak-ro, Sangnok-gu, Ansan-si, Gyeonggi-do 15588, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Department of Precision Medicine, National Institute of Health, Korea Disease Control and Prevention Agency, Osong Health Technology Administration Complex, 187, Osongsaengmyeong 2-ro, Osong-eup, Heungdeok-gu, Cheongju-si, Chungcheongbuk-do 28159, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Correspondence to: 323 Ilsan-ro, Ilsandonggu, Goyang-si, Gyeonggi-do 10408, Republic of Korea.Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.http://www.sciencedirect.com/science/article/pii/S2001037023000983Text-miningMachine-learningAlternative splicingTumor transcriptomeDatabaseGene signature
spellingShingle Kyubin Lee
Daejin Hyung
Soo Young Cho
Namhee Yu
Sewha Hong
Jihyun Kim
Sunshin Kim
Ji-Youn Han
Charny Park
Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
Computational and Structural Biotechnology Journal
Text-mining
Machine-learning
Alternative splicing
Tumor transcriptome
Database
Gene signature
title Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
title_full Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
title_fullStr Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
title_full_unstemmed Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
title_short Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
title_sort splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
topic Text-mining
Machine-learning
Alternative splicing
Tumor transcriptome
Database
Gene signature
url http://www.sciencedirect.com/science/article/pii/S2001037023000983
work_keys_str_mv AT kyubinlee splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT daejinhyung splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT sooyoungcho splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT namheeyu splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT sewhahong splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT jihyunkim splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT sunshinkim splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT jiyounhan splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning
AT charnypark splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning