Autophagy dark genes: Can we find them with machine learning?
Abstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genom...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley-VCH
2023-07-01
|
Series: | Natural Sciences |
Subjects: | |
Online Access: | https://doi.org/10.1002/ntls.20220067 |
_version_ | 1797786748819865600 |
---|---|
author | Mohsen Ranjbar Jeremy J. Yang Praveen Kumar Daniel R. Byrd Elaine L. Bearer Tudor I. Oprea |
author_facet | Mohsen Ranjbar Jeremy J. Yang Praveen Kumar Daniel R. Byrd Elaine L. Bearer Tudor I. Oprea |
author_sort | Mohsen Ranjbar |
collection | DOAJ |
description | Abstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genomic data. Here, a set of 103 (out of 288 genes from the Autophagy Database) was used as training set, based on ATG‐associated terms annotated from 3 secondary sources: GO (gene ontology), Kyoto Encyclopedia of Genes and Genomes pathway, and UniProt keywords, as additional confirmation of their importance in ATG. As negative labels, an OMIM list of genes associated with monogenic diseases was used (after excluding the 288 ATG‐associated genes). Data related to these genes from 17 different sources were compiled and used to derive a trained MetaPath/XGBoost (MPxgb) ML model for distinguishing ATG and non‐ATG genes (10‐fold cross‐validated, 100‐times randomized models, median area under the curve = 0.994 ± 0.008). Sixteen ATG‐relevant variables explained 64% of the total model gain. Overall, 23% of the top 251 predicted genes are annotated in the Autophagy Database, whereas 193 genes (77%) are not. In 2019, we suggested that some of these 193 genes may represent “ATG dark genes.” A literature search in 2022 for those top 20 predicted ATG dark genes found that 9 were subsequently reported as ATG genes during the intervening 3.5 years. A post‐factum evaluation of data leakage (the presence of ATG‐associated terms in the top 40 ML features) confirms that 7 out of these 9 genes and 2 out of 3 other recently validated predictions from the bottom 20 are novel. Those genes with the largest number of ATG features would be most likely to yield valuable experimental insights. Modern high‐throughput testing would be capable of spanning the full 193 ATG genes list reported here. Our analysis demonstrates that ML can guide genomics research to gain a more complete functional and pathway annotation of complex processes. Key points –A knowledge‐graph based machine learning model was designed for predicting unknown autophagy genes via mining functional genomic data. –Literature search validated predicted genes. –Our machine learning models could be generalized and applied to other genomic libraries to uncover dark genes for various functions. |
first_indexed | 2024-03-13T01:12:17Z |
format | Article |
id | doaj.art-068aaccd8812402ea4614d52be570ccc |
institution | Directory Open Access Journal |
issn | 2698-6248 |
language | English |
last_indexed | 2024-03-13T01:12:17Z |
publishDate | 2023-07-01 |
publisher | Wiley-VCH |
record_format | Article |
series | Natural Sciences |
spelling | doaj.art-068aaccd8812402ea4614d52be570ccc2023-07-05T16:05:48ZengWiley-VCHNatural Sciences2698-62482023-07-0133n/an/a10.1002/ntls.20220067Autophagy dark genes: Can we find them with machine learning?Mohsen Ranjbar0Jeremy J. Yang1Praveen Kumar2Daniel R. Byrd3Elaine L. Bearer4Tudor I. Oprea5Department of Chemistry and Chemical Biology University of New Mexico Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USAAlzheimer's Disease Research Center University of New Mexico Albuquerque New Mexico USADepartment of Internal Medicine University of New Mexico School of Medicine Albuquerque New Mexico USAAbstract Identifying novel autophagy (ATG) associated genes in humans remains an important task for understanding this fundamental physiological process. Machine learning (ML) can highlight potentially “missing pieces” linking core ATG genes with understudied, “dark” genes by mining functional genomic data. Here, a set of 103 (out of 288 genes from the Autophagy Database) was used as training set, based on ATG‐associated terms annotated from 3 secondary sources: GO (gene ontology), Kyoto Encyclopedia of Genes and Genomes pathway, and UniProt keywords, as additional confirmation of their importance in ATG. As negative labels, an OMIM list of genes associated with monogenic diseases was used (after excluding the 288 ATG‐associated genes). Data related to these genes from 17 different sources were compiled and used to derive a trained MetaPath/XGBoost (MPxgb) ML model for distinguishing ATG and non‐ATG genes (10‐fold cross‐validated, 100‐times randomized models, median area under the curve = 0.994 ± 0.008). Sixteen ATG‐relevant variables explained 64% of the total model gain. Overall, 23% of the top 251 predicted genes are annotated in the Autophagy Database, whereas 193 genes (77%) are not. In 2019, we suggested that some of these 193 genes may represent “ATG dark genes.” A literature search in 2022 for those top 20 predicted ATG dark genes found that 9 were subsequently reported as ATG genes during the intervening 3.5 years. A post‐factum evaluation of data leakage (the presence of ATG‐associated terms in the top 40 ML features) confirms that 7 out of these 9 genes and 2 out of 3 other recently validated predictions from the bottom 20 are novel. Those genes with the largest number of ATG features would be most likely to yield valuable experimental insights. Modern high‐throughput testing would be capable of spanning the full 193 ATG genes list reported here. Our analysis demonstrates that ML can guide genomics research to gain a more complete functional and pathway annotation of complex processes. Key points –A knowledge‐graph based machine learning model was designed for predicting unknown autophagy genes via mining functional genomic data. –Literature search validated predicted genes. –Our machine learning models could be generalized and applied to other genomic libraries to uncover dark genes for various functions.https://doi.org/10.1002/ntls.20220067autophagy genesextreme gradient boosting (XGBoost)knowledge‐graphmachine learningmining functional genomic data |
spellingShingle | Mohsen Ranjbar Jeremy J. Yang Praveen Kumar Daniel R. Byrd Elaine L. Bearer Tudor I. Oprea Autophagy dark genes: Can we find them with machine learning? Natural Sciences autophagy genes extreme gradient boosting (XGBoost) knowledge‐graph machine learning mining functional genomic data |
title | Autophagy dark genes: Can we find them with machine learning? |
title_full | Autophagy dark genes: Can we find them with machine learning? |
title_fullStr | Autophagy dark genes: Can we find them with machine learning? |
title_full_unstemmed | Autophagy dark genes: Can we find them with machine learning? |
title_short | Autophagy dark genes: Can we find them with machine learning? |
title_sort | autophagy dark genes can we find them with machine learning |
topic | autophagy genes extreme gradient boosting (XGBoost) knowledge‐graph machine learning mining functional genomic data |
url | https://doi.org/10.1002/ntls.20220067 |
work_keys_str_mv | AT mohsenranjbar autophagydarkgenescanwefindthemwithmachinelearning AT jeremyjyang autophagydarkgenescanwefindthemwithmachinelearning AT praveenkumar autophagydarkgenescanwefindthemwithmachinelearning AT danielrbyrd autophagydarkgenescanwefindthemwithmachinelearning AT elainelbearer autophagydarkgenescanwefindthemwithmachinelearning AT tudorioprea autophagydarkgenescanwefindthemwithmachinelearning |