Modeling aspects of the language of life through transfer-learning protein sequences

Abstract Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becom...

Full description

Bibliographic Details
Main Authors:	Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, Burkhard Rost
Format:	Article
Language:	English
Published:	BMC 2019-12-01
Series:	BMC Bioinformatics
Subjects:	Machine Learning Language Modeling Sequence Embedding Secondary structure prediction Localization prediction Transfer Learning
Online Access:	https://doi.org/10.1186/s12859-019-3220-8

_version_	1819275032678891520
author	Michael Heinzinger Ahmed Elnaggar Yu Wang Christian Dallago Dmitrii Nechaev Florian Matthes Burkhard Rost
author_facet	Michael Heinzinger Ahmed Elnaggar Yu Wang Christian Dallago Dmitrii Nechaev Florian Matthes Burkhard Rost
author_sort	Michael Heinzinger
collection	DOAJ
description	Abstract Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
first_indexed	2024-12-23T23:17:53Z
format	Article
id	doaj.art-53c4efcb1fe04a0e8cc7ce845b681905
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-23T23:17:53Z
publishDate	2019-12-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-53c4efcb1fe04a0e8cc7ce845b6819052022-12-21T17:26:27ZengBMCBMC Bioinformatics1471-21052019-12-0120111710.1186/s12859-019-3220-8Modeling aspects of the language of life through transfer-learning protein sequencesMichael Heinzinger0Ahmed Elnaggar1Yu Wang2Christian Dallago3Dmitrii Nechaev4Florian Matthes5Burkhard Rost6Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich)Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich)Leibniz Supercomputing CentreDepartment of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich)Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich)TUM Department of Informatics, Software Engineering and Business Information SystemsDepartment of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich)Abstract Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.https://doi.org/10.1186/s12859-019-3220-8Machine LearningLanguage ModelingSequence EmbeddingSecondary structure predictionLocalization predictionTransfer Learning
spellingShingle	Michael Heinzinger Ahmed Elnaggar Yu Wang Christian Dallago Dmitrii Nechaev Florian Matthes Burkhard Rost Modeling aspects of the language of life through transfer-learning protein sequences BMC Bioinformatics Machine Learning Language Modeling Sequence Embedding Secondary structure prediction Localization prediction Transfer Learning
title	Modeling aspects of the language of life through transfer-learning protein sequences
title_full	Modeling aspects of the language of life through transfer-learning protein sequences
title_fullStr	Modeling aspects of the language of life through transfer-learning protein sequences
title_full_unstemmed	Modeling aspects of the language of life through transfer-learning protein sequences
title_short	Modeling aspects of the language of life through transfer-learning protein sequences
title_sort	modeling aspects of the language of life through transfer learning protein sequences
topic	Machine Learning Language Modeling Sequence Embedding Secondary structure prediction Localization prediction Transfer Learning
url	https://doi.org/10.1186/s12859-019-3220-8
work_keys_str_mv	AT michaelheinzinger modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT ahmedelnaggar modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT yuwang modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT christiandallago modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT dmitriinechaev modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT florianmatthes modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT burkhardrost modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences

Modeling aspects of the language of life through transfer-learning protein sequences

Similar Items