Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci

The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignment...

Full description

Bibliographic Details
Main Authors: Jungreis, Irwin, He, Liang, Li, Yue, Kellis, Manolis
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: Cold Spring Harbor Laboratory 2020
Online Access:https://hdl.handle.net/1721.1/125496
_version_ 1811093324992348160
author Jungreis, Irwin
He, Liang
Li, Yue
Kellis, Manolis
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Jungreis, Irwin
He, Liang
Li, Yue
Kellis, Manolis
author_sort Jungreis, Irwin
collection MIT
description The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito.We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 highscoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.
first_indexed 2024-09-23T15:43:27Z
format Article
id mit-1721.1/125496
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T15:43:27Z
publishDate 2020
publisher Cold Spring Harbor Laboratory
record_format dspace
spelling mit-1721.1/1254962022-09-29T15:44:29Z Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci Jungreis, Irwin He, Liang Li, Yue Kellis, Manolis Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito.We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 highscoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization. National Institutes of Health (U.S.) (Award U41HG007234) Massachusetts Institute of Technology. Postdoctoral Research Fellowship. Wellcome Trust (Grant WT108749/Z/15/Z) National Science Foundation (U.S.) (Grant R01 HG004037) 2020-05-27T14:44:38Z 2020-05-27T14:44:38Z 2019-09 2020-01-15T18:23:52Z Article http://purl.org/eprint/type/JournalArticle 1088-9051 1549-5469 https://hdl.handle.net/1721.1/125496 Mudge, Jonathan M. et al. “Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.” Genome research 29 (2019): 2073-2087 © 2019 The Author(s) en https://dx.doi.org/10.1101/gr.246462.118 Genome research Creative Commons Attribution 4.0 International license https://creativecommons.org/licenses/by/4.0/ application/pdf Cold Spring Harbor Laboratory Cold Spring Harbor Laboratory Press
spellingShingle Jungreis, Irwin
He, Liang
Li, Yue
Kellis, Manolis
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title_full Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title_fullStr Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title_full_unstemmed Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title_short Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
title_sort discovery of high confidence human protein coding genes and exons by whole genome phylocsf helps elucidate 118 gwas loci
url https://hdl.handle.net/1721.1/125496
work_keys_str_mv AT jungreisirwin discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci
AT heliang discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci
AT liyue discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci
AT kellismanolis discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci