Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino aci...

Full description

Bibliographic Details
Main Authors: Kenta Motomura, Tomohiro Fujita, Motosuke Tsutsumi, Satsuki Kikuzato, Morikazu Nakamura, Joji M Otaki
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2012-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23185527/pdf/?tool=EBI
_version_ 1818573405511745536
author Kenta Motomura
Tomohiro Fujita
Motosuke Tsutsumi
Satsuki Kikuzato
Morikazu Nakamura
Joji M Otaki
author_facet Kenta Motomura
Tomohiro Fujita
Motosuke Tsutsumi
Satsuki Kikuzato
Morikazu Nakamura
Joji M Otaki
author_sort Kenta Motomura
collection DOAJ
description The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
first_indexed 2024-12-15T00:10:41Z
format Article
id doaj.art-69640475a7e64675a31d1c645d27b861
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-15T00:10:41Z
publishDate 2012-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-69640475a7e64675a31d1c645d27b8612022-12-21T22:42:35ZengPublic Library of Science (PLoS)PLoS ONE1932-62032012-01-01711e5003910.1371/journal.pone.0050039Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.Kenta MotomuraTomohiro FujitaMotosuke TsutsumiSatsuki KikuzatoMorikazu NakamuraJoji M OtakiThe amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23185527/pdf/?tool=EBI
spellingShingle Kenta Motomura
Tomohiro Fujita
Motosuke Tsutsumi
Satsuki Kikuzato
Morikazu Nakamura
Joji M Otaki
Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
PLoS ONE
title Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
title_full Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
title_fullStr Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
title_full_unstemmed Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
title_short Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.
title_sort word decoding of protein amino acid sequences with availability analysis a linguistic approach
url https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23185527/pdf/?tool=EBI
work_keys_str_mv AT kentamotomura worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach
AT tomohirofujita worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach
AT motosuketsutsumi worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach
AT satsukikikuzato worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach
AT morikazunakamura worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach
AT jojimotaki worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach