Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning
Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essent...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037023000983 |
_version_ | 1797384149125824512 |
---|---|
author | Kyubin Lee Daejin Hyung Soo Young Cho Namhee Yu Sewha Hong Jihyun Kim Sunshin Kim Ji-Youn Han Charny Park |
author_facet | Kyubin Lee Daejin Hyung Soo Young Cho Namhee Yu Sewha Hong Jihyun Kim Sunshin Kim Ji-Youn Han Charny Park |
author_sort | Kyubin Lee |
collection | DOAJ |
description | Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system. |
first_indexed | 2024-03-08T21:31:19Z |
format | Article |
id | doaj.art-7988cf259f9e41028a2b65a58e3b840b |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-03-08T21:31:19Z |
publishDate | 2023-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-7988cf259f9e41028a2b65a58e3b840b2023-12-21T07:31:09ZengElsevierComputational and Structural Biotechnology Journal2001-03702023-01-012119781988Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learningKyubin Lee0Daejin Hyung1Soo Young Cho2Namhee Yu3Sewha Hong4Jihyun Kim5Sunshin Kim6Ji-Youn Han7Charny Park8Research Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USAResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaDepartment of Molecular & Life Science, Hanyang University, 55 Hanyangdaehak-ro, Sangnok-gu, Ansan-si, Gyeonggi-do 15588, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Department of Precision Medicine, National Institute of Health, Korea Disease Control and Prevention Agency, Osong Health Technology Administration Complex, 187, Osongsaengmyeong 2-ro, Osong-eup, Heungdeok-gu, Cheongju-si, Chungcheongbuk-do 28159, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of KoreaResearch Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea; Correspondence to: 323 Ilsan-ro, Ilsandonggu, Goyang-si, Gyeonggi-do 10408, Republic of Korea.Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.http://www.sciencedirect.com/science/article/pii/S2001037023000983Text-miningMachine-learningAlternative splicingTumor transcriptomeDatabaseGene signature |
spellingShingle | Kyubin Lee Daejin Hyung Soo Young Cho Namhee Yu Sewha Hong Jihyun Kim Sunshin Kim Ji-Youn Han Charny Park Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning Computational and Structural Biotechnology Journal Text-mining Machine-learning Alternative splicing Tumor transcriptome Database Gene signature |
title | Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
title_full | Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
title_fullStr | Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
title_full_unstemmed | Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
title_short | Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
title_sort | splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning |
topic | Text-mining Machine-learning Alternative splicing Tumor transcriptome Database Gene signature |
url | http://www.sciencedirect.com/science/article/pii/S2001037023000983 |
work_keys_str_mv | AT kyubinlee splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT daejinhyung splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT sooyoungcho splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT namheeyu splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT sewhahong splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT jihyunkim splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT sunshinkim splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT jiyounhan splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning AT charnypark splicingsignaturedatabasedevelopmenttodelineatecancerpathwaysusingliteratureminingandtranscriptomemachinelearning |