Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery

Artificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We found that embedding can extract biologically relevant information from The Cancer G...

Full description

Bibliographic Details
Main Authors: Chi Tung Choy, Chi Hang Wong, Stephen Lam Chan
Format: Article
Language:English
Published: Frontiers Media S.A. 2019-01-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/article/10.3389/fgene.2018.00682/full
_version_ 1818032455950532608
author Chi Tung Choy
Chi Hang Wong
Stephen Lam Chan
Stephen Lam Chan
author_facet Chi Tung Choy
Chi Hang Wong
Stephen Lam Chan
Stephen Lam Chan
author_sort Chi Tung Choy
collection DOAJ
description Artificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We found that embedding can extract biologically relevant information from The Cancer Genome Atlas (TCGA) gene expression dataset by learning a vector representation through gene co-occurrence. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list as in the case of immunotherapy response. 73 related genes are singled out while the relatedness of 55 genes with immune checkpoint proteins (PD-1, PD-L1, and CTLA-4) are supported by literature. 16 novel genes (ACAP1, C11orf45, CD79B, CFP, CLIC2, CMPK2, CXCR2P1, CYTIP, FER, MCTO1, MMP25, RASGEF1B, SLFN12, TBC1D10C, TRAF3IP3, TTC39B) related to immune checkpoint proteins were identified. Thus, this method is feasible to mine big volume of biological data, and embedding would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference for gene relatedness in the context of cancer and is readily applicable to biomarker discovery of any molecular targeted therapy.
first_indexed 2024-12-10T06:07:39Z
format Article
id doaj.art-7b74e3f826544ad5b4fb5ba4e6e46e52
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-12-10T06:07:39Z
publishDate 2019-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-7b74e3f826544ad5b4fb5ba4e6e46e522022-12-22T01:59:40ZengFrontiers Media S.A.Frontiers in Genetics1664-80212019-01-01910.3389/fgene.2018.00682421857Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker DiscoveryChi Tung Choy0Chi Hang Wong1Stephen Lam Chan2Stephen Lam Chan3State Key Laboratory of Translational Oncology, Department of Clinical Oncology, Faculty of Medicine, The Chinese University of Hong Kong, Sha Tin, Hong KongState Key Laboratory of Translational Oncology, Department of Clinical Oncology, Faculty of Medicine, The Chinese University of Hong Kong, Sha Tin, Hong KongState Key Laboratory of Translational Oncology, Department of Clinical Oncology, Faculty of Medicine, The Chinese University of Hong Kong, Sha Tin, Hong KongState Key Laboratory of Digestive Disease, Institute of Digestive Disease, The Chinese University of Hong Kong, Sha Tin, Hong KongArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We found that embedding can extract biologically relevant information from The Cancer Genome Atlas (TCGA) gene expression dataset by learning a vector representation through gene co-occurrence. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list as in the case of immunotherapy response. 73 related genes are singled out while the relatedness of 55 genes with immune checkpoint proteins (PD-1, PD-L1, and CTLA-4) are supported by literature. 16 novel genes (ACAP1, C11orf45, CD79B, CFP, CLIC2, CMPK2, CXCR2P1, CYTIP, FER, MCTO1, MMP25, RASGEF1B, SLFN12, TBC1D10C, TRAF3IP3, TTC39B) related to immune checkpoint proteins were identified. Thus, this method is feasible to mine big volume of biological data, and embedding would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference for gene relatedness in the context of cancer and is readily applicable to biomarker discovery of any molecular targeted therapy.https://www.frontiersin.org/article/10.3389/fgene.2018.00682/fullgene embeddingTCGA data miningbiomarker discoverymachine learningimmunothearpy
spellingShingle Chi Tung Choy
Chi Hang Wong
Stephen Lam Chan
Stephen Lam Chan
Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
Frontiers in Genetics
gene embedding
TCGA data mining
biomarker discovery
machine learning
immunothearpy
title Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
title_full Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
title_fullStr Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
title_full_unstemmed Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
title_short Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery
title_sort embedding of genes using cancer gene expression data biological relevance and potential application on biomarker discovery
topic gene embedding
TCGA data mining
biomarker discovery
machine learning
immunothearpy
url https://www.frontiersin.org/article/10.3389/fgene.2018.00682/full
work_keys_str_mv AT chitungchoy embeddingofgenesusingcancergeneexpressiondatabiologicalrelevanceandpotentialapplicationonbiomarkerdiscovery
AT chihangwong embeddingofgenesusingcancergeneexpressiondatabiologicalrelevanceandpotentialapplicationonbiomarkerdiscovery
AT stephenlamchan embeddingofgenesusingcancergeneexpressiondatabiologicalrelevanceandpotentialapplicationonbiomarkerdiscovery
AT stephenlamchan embeddingofgenesusingcancergeneexpressiondatabiologicalrelevanceandpotentialapplicationonbiomarkerdiscovery