GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics

Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates ac...

Full description

Bibliographic Details
Main Authors: Zijie Shen, Enhui Shen, Qian-Hao Zhu, Longjiang Fan, Quan Zou, Chu-Yu Ye
Format: Article
Language:English
Published: Wiley 2023-12-01
Series:Advanced Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1002/aisy.202300426
_version_ 1827400952539250688
author Zijie Shen
Enhui Shen
Qian-Hao Zhu
Longjiang Fan
Quan Zou
Chu-Yu Ye
author_facet Zijie Shen
Enhui Shen
Qian-Hao Zhu
Longjiang Fan
Quan Zou
Chu-Yu Ye
author_sort Zijie Shen
collection DOAJ
description Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool.
first_indexed 2024-03-08T20:12:10Z
format Article
id doaj.art-833d710b2de44a22ad0b79ff6e27be02
institution Directory Open Access Journal
issn 2640-4567
language English
last_indexed 2024-03-08T20:12:10Z
publishDate 2023-12-01
publisher Wiley
record_format Article
series Advanced Intelligent Systems
spelling doaj.art-833d710b2de44a22ad0b79ff6e27be022023-12-23T04:53:50ZengWileyAdvanced Intelligent Systems2640-45672023-12-01512n/an/a10.1002/aisy.202300426GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in GenomicsZijie Shen0Enhui Shen1Qian-Hao Zhu2Longjiang Fan3Quan Zou4Chu-Yu Ye5Hainan Institute Zhejiang University Sanya 572025 ChinaInstitute of Crop Science & Institute of Bioinformatics College of Agriculture & Biotechnology Zhejiang University Hangzhou 310058 ChinaCSIRO Agriculture and Food GPO Box 1700 Canberra ACT 2601 AustraliaHainan Institute Zhejiang University Sanya 572025 ChinaYangtze Delta Region Institute (Quzhou) University of Electronic Science and Technology of China Quzhou 324003 ChinaInstitute of Crop Science & Institute of Bioinformatics College of Agriculture & Biotechnology Zhejiang University Hangzhou 310058 ChinaMachine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool.https://doi.org/10.1002/aisy.202300426genome-to-phenotype (G2P)genomic descriptorgenomic machine learningsupervised learningvariety protection
spellingShingle Zijie Shen
Enhui Shen
Qian-Hao Zhu
Longjiang Fan
Quan Zou
Chu-Yu Ye
GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
Advanced Intelligent Systems
genome-to-phenotype (G2P)
genomic descriptor
genomic machine learning
supervised learning
variety protection
title GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
title_full GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
title_fullStr GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
title_full_unstemmed GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
title_short GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
title_sort gsctool a novel descriptor that characterizes the genome for applying machine learning in genomics
topic genome-to-phenotype (G2P)
genomic descriptor
genomic machine learning
supervised learning
variety protection
url https://doi.org/10.1002/aisy.202300426
work_keys_str_mv AT zijieshen gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics
AT enhuishen gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics
AT qianhaozhu gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics
AT longjiangfan gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics
AT quanzou gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics
AT chuyuye gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics