Segmentation of genomic data through multivariate statistical approaches: comparative analysis

Segmenting a series of measurements along a genome into regions with distinct characteristics is widely used to identify functional components of a genome. The majority of the research on biological data segmentation focuses on the statistical problem of identifying break or change-points in a simu...

Full description

Bibliographic Details
Main Authors: ARFA ANJUM, SEEMA JAGGI, SHWETANK LALL, ELDHO VARGHESE, ANIL RAI, ARPAN BHOWMIK, DWIJESH CHANDRA MISHRA
Format: Article
Language:English
Published: Indian Council of Agricultural Research 2022-03-01
Series:The Indian Journal of Agricultural Sciences
Subjects:
Online Access:https://epubs.icar.org.in/index.php/IJAgS/article/view/118040
_version_ 1811169091930554368
author ARFA ANJUM
SEEMA JAGGI
SHWETANK LALL
ELDHO VARGHESE
ANIL RAI
ARPAN BHOWMIK
DWIJESH CHANDRA MISHRA
author_facet ARFA ANJUM
SEEMA JAGGI
SHWETANK LALL
ELDHO VARGHESE
ANIL RAI
ARPAN BHOWMIK
DWIJESH CHANDRA MISHRA
author_sort ARFA ANJUM
collection DOAJ
description Segmenting a series of measurements along a genome into regions with distinct characteristics is widely used to identify functional components of a genome. The majority of the research on biological data segmentation focuses on the statistical problem of identifying break or change-points in a simulated scenario using a single variable. Despite the fact that various strategies for finding change-points in a multivariate setup through simulation are available, work on segmenting actual multivariate genomic data is limited. This is due to the fact that genomic data is huge in size and contains a lot of variation within it. Therefore, a study was carried out at the ICAR-Indian Agricultural Statistics Research Institute, New Delhi during 2021 to know the best multivariate statistical method to segment the sequences which may influence the properties or function of a sequence into homogeneous segments. This will reduce the volume of data and ease the analysis of these segments further to know the actual properties of these segments. The genomic data of Rice (Oryza sativa L.) was considered for the comparative analysis of several multivariate approaches and was found that agglomerative sequential clustering was the most acceptable due to its low computational cost and feasibility.
first_indexed 2024-04-10T16:37:39Z
format Article
id doaj.art-3b83166d769f418cbd876835aba72a13
institution Directory Open Access Journal
issn 0019-5022
2394-3319
language English
last_indexed 2024-04-10T16:37:39Z
publishDate 2022-03-01
publisher Indian Council of Agricultural Research
record_format Article
series The Indian Journal of Agricultural Sciences
spelling doaj.art-3b83166d769f418cbd876835aba72a132023-02-08T11:12:15ZengIndian Council of Agricultural ResearchThe Indian Journal of Agricultural Sciences0019-50222394-33192022-03-0192710.56093/ijas.v92i7.118040Segmentation of genomic data through multivariate statistical approaches: comparative analysisARFA ANJUM0SEEMA JAGGI1SHWETANK LALL2ELDHO VARGHESE3ANIL RAI4ARPAN BHOWMIK5DWIJESH CHANDRA MISHRA6Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New DelhiAssistant Director General (HRD),Education Division,KAB II, ICAR, New DelhiAristocrat Technologies, New DelhiFishery Resources Assessment Division,ICAR-Central Marine Fisheries Research Institute, KochiAssistant Director General (ICT),Khishi Bhavan, ICAR, New Delhi(Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi)Division of Design of Experiments, ICAR-Indian Agricultural Statistics Research Institute, New DelhiCentre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi Segmenting a series of measurements along a genome into regions with distinct characteristics is widely used to identify functional components of a genome. The majority of the research on biological data segmentation focuses on the statistical problem of identifying break or change-points in a simulated scenario using a single variable. Despite the fact that various strategies for finding change-points in a multivariate setup through simulation are available, work on segmenting actual multivariate genomic data is limited. This is due to the fact that genomic data is huge in size and contains a lot of variation within it. Therefore, a study was carried out at the ICAR-Indian Agricultural Statistics Research Institute, New Delhi during 2021 to know the best multivariate statistical method to segment the sequences which may influence the properties or function of a sequence into homogeneous segments. This will reduce the volume of data and ease the analysis of these segments further to know the actual properties of these segments. The genomic data of Rice (Oryza sativa L.) was considered for the comparative analysis of several multivariate approaches and was found that agglomerative sequential clustering was the most acceptable due to its low computational cost and feasibility. https://epubs.icar.org.in/index.php/IJAgS/article/view/118040GenomeSegmentationMultivariate analysisSequential clustering
spellingShingle ARFA ANJUM
SEEMA JAGGI
SHWETANK LALL
ELDHO VARGHESE
ANIL RAI
ARPAN BHOWMIK
DWIJESH CHANDRA MISHRA
Segmentation of genomic data through multivariate statistical approaches: comparative analysis
The Indian Journal of Agricultural Sciences
Genome
Segmentation
Multivariate analysis
Sequential clustering
title Segmentation of genomic data through multivariate statistical approaches: comparative analysis
title_full Segmentation of genomic data through multivariate statistical approaches: comparative analysis
title_fullStr Segmentation of genomic data through multivariate statistical approaches: comparative analysis
title_full_unstemmed Segmentation of genomic data through multivariate statistical approaches: comparative analysis
title_short Segmentation of genomic data through multivariate statistical approaches: comparative analysis
title_sort segmentation of genomic data through multivariate statistical approaches comparative analysis
topic Genome
Segmentation
Multivariate analysis
Sequential clustering
url https://epubs.icar.org.in/index.php/IJAgS/article/view/118040
work_keys_str_mv AT arfaanjum segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT seemajaggi segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT shwetanklall segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT eldhovarghese segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT anilrai segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT arpanbhowmik segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis
AT dwijeshchandramishra segmentationofgenomicdatathroughmultivariatestatisticalapproachescomparativeanalysis