Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength

The application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investig...

Full description

Bibliographic Details
Main Authors: Feifan Zheng, Xin Jiang, Yuhao Wen, Yan Yang, Minghui Li
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037023004920
_version_ 1797371453629267968
author Feifan Zheng
Xin Jiang
Yuhao Wen
Yan Yang
Minghui Li
author_facet Feifan Zheng
Xin Jiang
Yuhao Wen
Yan Yang
Minghui Li
author_sort Feifan Zheng
collection DOAJ
description The application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investigation into the application of machine learning on limited data. The binding strength, quantitatively measured as binding affinity, is vital for understanding the processes of recognition, association, and dysfunction that occur within protein complexes. By incorporating transfer learning, integrating domain knowledge, and employing both deep learning and traditional machine learning algorithms, we mitigated the impact of data limitations and made significant advancements in predicting protein-protein binding affinity. In particular, we developed over 20 models, ultimately selecting three representative best-performing ones that belong to distinct categories. The first model is structure-based, consisting of a random forest regression and thirteen handcrafted features. The second model is sequence-based, employing an architecture that combines transferred embedding features with a multilayer perceptron. Finally, we created an ensemble model by averaging the predictions of the two aforementioned models. The comparison with other predictors on three independent datasets confirms the significant improvements achieved by our models in predicting protein-protein binding affinity. The programs for running these three models are available at https://github.com/minghuilab/BindPPI.
first_indexed 2024-03-08T18:19:04Z
format Article
id doaj.art-ec75bc088f3c43d29f0503f210929d2f
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-03-08T18:19:04Z
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-ec75bc088f3c43d29f0503f210929d2f2023-12-31T04:26:10ZengElsevierComputational and Structural Biotechnology Journal2001-03702024-12-0123460472Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strengthFeifan Zheng0Xin Jiang1Yuhao Wen2Yan Yang3Minghui Li4MOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaCorresponding author.; MOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaThe application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investigation into the application of machine learning on limited data. The binding strength, quantitatively measured as binding affinity, is vital for understanding the processes of recognition, association, and dysfunction that occur within protein complexes. By incorporating transfer learning, integrating domain knowledge, and employing both deep learning and traditional machine learning algorithms, we mitigated the impact of data limitations and made significant advancements in predicting protein-protein binding affinity. In particular, we developed over 20 models, ultimately selecting three representative best-performing ones that belong to distinct categories. The first model is structure-based, consisting of a random forest regression and thirteen handcrafted features. The second model is sequence-based, employing an architecture that combines transferred embedding features with a multilayer perceptron. Finally, we created an ensemble model by averaging the predictions of the two aforementioned models. The comparison with other predictors on three independent datasets confirms the significant improvements achieved by our models in predicting protein-protein binding affinity. The programs for running these three models are available at https://github.com/minghuilab/BindPPI.http://www.sciencedirect.com/science/article/pii/S2001037023004920Protein-protein binding affinityMachine learning methodsTools
spellingShingle Feifan Zheng
Xin Jiang
Yuhao Wen
Yan Yang
Minghui Li
Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
Computational and Structural Biotechnology Journal
Protein-protein binding affinity
Machine learning methods
Tools
title Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
title_full Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
title_fullStr Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
title_full_unstemmed Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
title_short Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
title_sort systematic investigation of machine learning on limited data a study on predicting protein protein binding strength
topic Protein-protein binding affinity
Machine learning methods
Tools
url http://www.sciencedirect.com/science/article/pii/S2001037023004920
work_keys_str_mv AT feifanzheng systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength
AT xinjiang systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength
AT yuhaowen systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength
AT yanyang systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength
AT minghuili systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength