A model of k-mer surprisal to quantify local sequence information content surrounding splice regions

Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These meth...

Full description

Bibliographic Details
Main Authors:	Sam Humphrey, Alastair Kerr, Magnus Rattray, Caroline Dive, Crispin J. Miller
Format:	Article
Language:	English
Published:	PeerJ Inc. 2020-11-01
Series:	PeerJ
Subjects:	Information theory Surprisal Splicing Entropy
Online Access:	https://peerj.com/articles/10063.pdf

_version_	1797419873210466304
author	Sam Humphrey Alastair Kerr Magnus Rattray Caroline Dive Crispin J. Miller
author_facet	Sam Humphrey Alastair Kerr Magnus Rattray Caroline Dive Crispin J. Miller
author_sort	Sam Humphrey
collection	DOAJ
description	Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment.
first_indexed	2024-03-09T06:54:31Z
format	Article
id	doaj.art-c6f890ee67d44d59b6a444a2f4c4411e
institution	Directory Open Access Journal
issn	2167-8359
language	English
last_indexed	2024-03-09T06:54:31Z
publishDate	2020-11-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ
spelling	doaj.art-c6f890ee67d44d59b6a444a2f4c4411e2023-12-03T10:14:19ZengPeerJ Inc.PeerJ2167-83592020-11-018e1006310.7717/peerj.10063A model of k-mer surprisal to quantify local sequence information content surrounding splice regionsSam Humphrey0Alastair Kerr1Magnus Rattray2Caroline Dive3Crispin J. Miller4CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United KingdomCRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United KingdomDivision of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United KingdomCRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United KingdomComputational Biology Group, CRUK Beatson Institute, Glasgow, United KingdomMolecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment.https://peerj.com/articles/10063.pdfInformation theorySurprisalSplicingEntropy
spellingShingle	Sam Humphrey Alastair Kerr Magnus Rattray Caroline Dive Crispin J. Miller A model of k-mer surprisal to quantify local sequence information content surrounding splice regions PeerJ Information theory Surprisal Splicing Entropy
title	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_full	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_fullStr	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_full_unstemmed	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_short	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_sort	model of k mer surprisal to quantify local sequence information content surrounding splice regions
topic	Information theory Surprisal Splicing Entropy
url	https://peerj.com/articles/10063.pdf
work_keys_str_mv	AT samhumphrey amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT alastairkerr amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT magnusrattray amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT carolinedive amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT crispinjmiller amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT samhumphrey modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT alastairkerr modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT magnusrattray modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT carolinedive modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT crispinjmiller modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions

A model of k-mer surprisal to quantify local sequence information content surrounding splice regions

Similar Items