GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics
Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates ac...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2023-12-01
|
Series: | Advanced Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1002/aisy.202300426 |
_version_ | 1827400952539250688 |
---|---|
author | Zijie Shen Enhui Shen Qian-Hao Zhu Longjiang Fan Quan Zou Chu-Yu Ye |
author_facet | Zijie Shen Enhui Shen Qian-Hao Zhu Longjiang Fan Quan Zou Chu-Yu Ye |
author_sort | Zijie Shen |
collection | DOAJ |
description | Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool. |
first_indexed | 2024-03-08T20:12:10Z |
format | Article |
id | doaj.art-833d710b2de44a22ad0b79ff6e27be02 |
institution | Directory Open Access Journal |
issn | 2640-4567 |
language | English |
last_indexed | 2024-03-08T20:12:10Z |
publishDate | 2023-12-01 |
publisher | Wiley |
record_format | Article |
series | Advanced Intelligent Systems |
spelling | doaj.art-833d710b2de44a22ad0b79ff6e27be022023-12-23T04:53:50ZengWileyAdvanced Intelligent Systems2640-45672023-12-01512n/an/a10.1002/aisy.202300426GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in GenomicsZijie Shen0Enhui Shen1Qian-Hao Zhu2Longjiang Fan3Quan Zou4Chu-Yu Ye5Hainan Institute Zhejiang University Sanya 572025 ChinaInstitute of Crop Science & Institute of Bioinformatics College of Agriculture & Biotechnology Zhejiang University Hangzhou 310058 ChinaCSIRO Agriculture and Food GPO Box 1700 Canberra ACT 2601 AustraliaHainan Institute Zhejiang University Sanya 572025 ChinaYangtze Delta Region Institute (Quzhou) University of Electronic Science and Technology of China Quzhou 324003 ChinaInstitute of Crop Science & Institute of Bioinformatics College of Agriculture & Biotechnology Zhejiang University Hangzhou 310058 ChinaMachine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool.https://doi.org/10.1002/aisy.202300426genome-to-phenotype (G2P)genomic descriptorgenomic machine learningsupervised learningvariety protection |
spellingShingle | Zijie Shen Enhui Shen Qian-Hao Zhu Longjiang Fan Quan Zou Chu-Yu Ye GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics Advanced Intelligent Systems genome-to-phenotype (G2P) genomic descriptor genomic machine learning supervised learning variety protection |
title | GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics |
title_full | GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics |
title_fullStr | GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics |
title_full_unstemmed | GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics |
title_short | GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics |
title_sort | gsctool a novel descriptor that characterizes the genome for applying machine learning in genomics |
topic | genome-to-phenotype (G2P) genomic descriptor genomic machine learning supervised learning variety protection |
url | https://doi.org/10.1002/aisy.202300426 |
work_keys_str_mv | AT zijieshen gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics AT enhuishen gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics AT qianhaozhu gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics AT longjiangfan gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics AT quanzou gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics AT chuyuye gsctoolanoveldescriptorthatcharacterizesthegenomeforapplyingmachinelearningingenomics |