Autophagy dark genes: Can we find them with machine learning?

Abstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genom...

Full description

Bibliographic Details
Main Authors: Mohsen Ranjbar, Jeremy J. Yang, Praveen Kumar, Daniel R. Byrd, Elaine L. Bearer, Tudor I. Oprea
Format: Article
Language:English
Published: Wiley-VCH 2023-07-01
Series:Natural Sciences
Subjects:
Online Access:https://doi.org/10.1002/ntls.20220067
_version_ 1797786748819865600
author Mohsen Ranjbar
Jeremy J. Yang
Praveen Kumar
Daniel R. Byrd
Elaine L. Bearer
Tudor I. Oprea
author_facet Mohsen Ranjbar
Jeremy J. Yang
Praveen Kumar
Daniel R. Byrd
Elaine L. Bearer
Tudor I. Oprea
author_sort Mohsen Ranjbar
collection DOAJ
description Abstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genomic data. Here, a set of 103 (out of 288 genes from the Autophagy Database) was used as training set, based on ATG‐associated terms annotated from 3 secondary sources: GO (gene ontology), Kyoto Encyclopedia of Genes and Genomes pathway, and UniProt keywords, as additional confirmation of their importance in ATG. As negative labels, an OMIM list of genes associated with monogenic diseases was used (after excluding the 288 ATG‐associated genes). Data related to these genes from 17 different sources were compiled and used to derive a trained MetaPath/XGBoost (MPxgb) ML model for distinguishing ATG and non‐ATG genes (10‐fold cross‐validated, 100‐times randomized models, median area under the curve = 0.994 ± 0.008). Sixteen ATG‐relevant variables explained 64% of the total model gain. Overall, 23% of the top 251 predicted genes are annotated in the Autophagy Database, whereas 193 genes (77%) are not. In 2019, we suggested that some of these 193 genes may represent “ATG dark genes.” A literature search in 2022 for those top 20 predicted ATG dark genes found that 9 were subsequently reported as ATG genes during the intervening 3.5 years. A post‐factum evaluation of data leakage (the presence of ATG‐associated terms in the top 40 ML features) confirms that 7 out of these 9 genes and 2 out of 3 other recently validated predictions from the bottom 20 are novel. Those genes with the largest number of ATG features would be most likely to yield valuable experimental insights. Modern high‐throughput testing would be capable of spanning the full 193 ATG genes list reported here. Our analysis demonstrates that ML can guide genomics research to gain a more complete functional and pathway annotation of complex processes. Key points –A knowledge‐graph based machine learning model was designed for predicting unknown autophagy genes via mining functional genomic data. –Literature search validated predicted genes. –Our machine learning models could be generalized and applied to other genomic libraries to uncover dark genes for various functions.
first_indexed 2024-03-13T01:12:17Z
format Article
id doaj.art-068aaccd8812402ea4614d52be570ccc
institution Directory Open Access Journal
issn 2698-6248
language English
last_indexed 2024-03-13T01:12:17Z
publishDate 2023-07-01
publisher Wiley-VCH
record_format Article
series Natural Sciences
spelling doaj.art-068aaccd8812402ea4614d52be570ccc2023-07-05T16:05:48ZengWiley-VCHNatural Sciences2698-62482023-07-0133n/an/a10.1002/ntls.20220067Autophagy dark genes: Can we find them with machine learning?Mohsen Ranjbar0Jeremy J. Yang1Praveen Kumar2Daniel R. Byrd3Elaine L. Bearer4Tudor I. Oprea5Department of Chemistry and Chemical Biology University of New Mexico Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USAAlzheimer's Disease Research Center University of New Mexico Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USAAbstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genomic data. Here, a set of 103 (out of 288 genes from the Autophagy Database) was used as training set, based on ATG‐associated terms annotated from 3 secondary sources: GO (gene ontology), Kyoto Encyclopedia of Genes and Genomes pathway, and UniProt keywords, as additional confirmation of their importance in ATG. As negative labels, an OMIM list of genes associated with monogenic diseases was used (after excluding the 288 ATG‐associated genes). Data related to these genes from 17 different sources were compiled and used to derive a trained MetaPath/XGBoost (MPxgb) ML model for distinguishing ATG and non‐ATG genes (10‐fold cross‐validated, 100‐times randomized models, median area under the curve = 0.994 ± 0.008). Sixteen ATG‐relevant variables explained 64% of the total model gain. Overall, 23% of the top 251 predicted genes are annotated in the Autophagy Database, whereas 193 genes (77%) are not. In 2019, we suggested that some of these 193 genes may represent “ATG dark genes.” A literature search in 2022 for those top 20 predicted ATG dark genes found that 9 were subsequently reported as ATG genes during the intervening 3.5 years. A post‐factum evaluation of data leakage (the presence of ATG‐associated terms in the top 40 ML features) confirms that 7 out of these 9 genes and 2 out of 3 other recently validated predictions from the bottom 20 are novel. Those genes with the largest number of ATG features would be most likely to yield valuable experimental insights. Modern high‐throughput testing would be capable of spanning the full 193 ATG genes list reported here. Our analysis demonstrates that ML can guide genomics research to gain a more complete functional and pathway annotation of complex processes. Key points –A knowledge‐graph based machine learning model was designed for predicting unknown autophagy genes via mining functional genomic data. –Literature search validated predicted genes. –Our machine learning models could be generalized and applied to other genomic libraries to uncover dark genes for various functions.https://doi.org/10.1002/ntls.20220067autophagy genesextreme gradient boosting (XGBoost)knowledge‐graphmachine learningmining functional genomic data
spellingShingle Mohsen Ranjbar
Jeremy J. Yang
Praveen Kumar
Daniel R. Byrd
Elaine L. Bearer
Tudor I. Oprea
Autophagy dark genes: Can we find them with machine learning?
Natural Sciences
autophagy genes
extreme gradient boosting (XGBoost)
knowledge‐graph
machine learning
mining functional genomic data
title Autophagy dark genes: Can we find them with machine learning?
title_full Autophagy dark genes: Can we find them with machine learning?
title_fullStr Autophagy dark genes: Can we find them with machine learning?
title_full_unstemmed Autophagy dark genes: Can we find them with machine learning?
title_short Autophagy dark genes: Can we find them with machine learning?
title_sort autophagy dark genes can we find them with machine learning
topic autophagy genes
extreme gradient boosting (XGBoost)
knowledge‐graph
machine learning
mining functional genomic data
url https://doi.org/10.1002/ntls.20220067
work_keys_str_mv AT mohsenranjbar autophagydarkgenescanwefindthemwithmachinelearning
AT jeremyjyang autophagydarkgenescanwefindthemwithmachinelearning
AT praveenkumar autophagydarkgenescanwefindthemwithmachinelearning
AT danielrbyrd autophagydarkgenescanwefindthemwithmachinelearning
AT elainelbearer autophagydarkgenescanwefindthemwithmachinelearning
AT tudorioprea autophagydarkgenescanwefindthemwithmachinelearning