CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers

Abstract Background Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inhe...

Full description

Bibliographic Details
Main Authors: Osamu Maruyama, Yinuo Li, Hiroki Narita, Hidehiro Toh, Wan Kin Au Yeung, Hiroyuki Sasaki
Format: Article
Language:English
Published: BMC 2022-09-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-04916-3
_version_ 1798036353276968960
author Osamu Maruyama
Yinuo Li
Hiroki Narita
Hidehiro Toh
Wan Kin Au Yeung
Hiroyuki Sasaki
author_facet Osamu Maruyama
Yinuo Li
Hiroki Narita
Hidehiro Toh
Wan Kin Au Yeung
Hiroyuki Sasaki
author_sort Osamu Maruyama
collection DOAJ
description Abstract Background Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. Results We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range $$k_{\min }$$ k min to $$k_{\max }$$ k max , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. Conclusions We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data.
first_indexed 2024-04-11T21:11:41Z
format Article
id doaj.art-2368566fd5e84a869956acc7c60ac72e
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-11T21:11:41Z
publishDate 2022-09-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-2368566fd5e84a869956acc7c60ac72e2022-12-22T04:02:59ZengBMCBMC Bioinformatics1471-21052022-09-0123112010.1186/s12859-022-04916-3CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mersOsamu Maruyama0Yinuo Li1Hiroki Narita2Hidehiro Toh3Wan Kin Au Yeung4Hiroyuki Sasaki5Faculty of Design, Kyushu UniversityGraduate School of Design, Kyushu UniversitySchool of Design, Kyushu UniversityDivision of Epigenomics and Development, Medical Institute of Bioregulation, Kyushu UniversityDivision of Epigenomics and Development, Medical Institute of Bioregulation, Kyushu UniversityDivision of Epigenomics and Development, Medical Institute of Bioregulation, Kyushu UniversityAbstract Background Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. Results We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range $$k_{\min }$$ k min to $$k_{\max }$$ k max , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. Conclusions We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data.https://doi.org/10.1186/s12859-022-04916-3Recurrent neural networkGated recurrent unitClassificationOocyteBlastocystEmbryo
spellingShingle Osamu Maruyama
Yinuo Li
Hiroki Narita
Hidehiro Toh
Wan Kin Au Yeung
Hiroyuki Sasaki
CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
BMC Bioinformatics
Recurrent neural network
Gated recurrent unit
Classification
Oocyte
Blastocyst
Embryo
title CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_full CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_fullStr CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_full_unstemmed CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_short CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_sort cmic predicting dna methylation inheritance of cpg islands with embedding vectors of variable length k mers
topic Recurrent neural network
Gated recurrent unit
Classification
Oocyte
Blastocyst
Embryo
url https://doi.org/10.1186/s12859-022-04916-3
work_keys_str_mv AT osamumaruyama cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT yinuoli cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT hirokinarita cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT hidehirotoh cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT wankinauyeung cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT hiroyukisasaki cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers