Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength
The application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investig...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2024-12-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037023004920 |
_version_ | 1797371453629267968 |
---|---|
author | Feifan Zheng Xin Jiang Yuhao Wen Yan Yang Minghui Li |
author_facet | Feifan Zheng Xin Jiang Yuhao Wen Yan Yang Minghui Li |
author_sort | Feifan Zheng |
collection | DOAJ |
description | The application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investigation into the application of machine learning on limited data. The binding strength, quantitatively measured as binding affinity, is vital for understanding the processes of recognition, association, and dysfunction that occur within protein complexes. By incorporating transfer learning, integrating domain knowledge, and employing both deep learning and traditional machine learning algorithms, we mitigated the impact of data limitations and made significant advancements in predicting protein-protein binding affinity. In particular, we developed over 20 models, ultimately selecting three representative best-performing ones that belong to distinct categories. The first model is structure-based, consisting of a random forest regression and thirteen handcrafted features. The second model is sequence-based, employing an architecture that combines transferred embedding features with a multilayer perceptron. Finally, we created an ensemble model by averaging the predictions of the two aforementioned models. The comparison with other predictors on three independent datasets confirms the significant improvements achieved by our models in predicting protein-protein binding affinity. The programs for running these three models are available at https://github.com/minghuilab/BindPPI. |
first_indexed | 2024-03-08T18:19:04Z |
format | Article |
id | doaj.art-ec75bc088f3c43d29f0503f210929d2f |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-03-08T18:19:04Z |
publishDate | 2024-12-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-ec75bc088f3c43d29f0503f210929d2f2023-12-31T04:26:10ZengElsevierComputational and Structural Biotechnology Journal2001-03702024-12-0123460472Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strengthFeifan Zheng0Xin Jiang1Yuhao Wen2Yan Yang3Minghui Li4MOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaMOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaCorresponding author.; MOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, ChinaThe application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investigation into the application of machine learning on limited data. The binding strength, quantitatively measured as binding affinity, is vital for understanding the processes of recognition, association, and dysfunction that occur within protein complexes. By incorporating transfer learning, integrating domain knowledge, and employing both deep learning and traditional machine learning algorithms, we mitigated the impact of data limitations and made significant advancements in predicting protein-protein binding affinity. In particular, we developed over 20 models, ultimately selecting three representative best-performing ones that belong to distinct categories. The first model is structure-based, consisting of a random forest regression and thirteen handcrafted features. The second model is sequence-based, employing an architecture that combines transferred embedding features with a multilayer perceptron. Finally, we created an ensemble model by averaging the predictions of the two aforementioned models. The comparison with other predictors on three independent datasets confirms the significant improvements achieved by our models in predicting protein-protein binding affinity. The programs for running these three models are available at https://github.com/minghuilab/BindPPI.http://www.sciencedirect.com/science/article/pii/S2001037023004920Protein-protein binding affinityMachine learning methodsTools |
spellingShingle | Feifan Zheng Xin Jiang Yuhao Wen Yan Yang Minghui Li Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength Computational and Structural Biotechnology Journal Protein-protein binding affinity Machine learning methods Tools |
title | Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength |
title_full | Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength |
title_fullStr | Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength |
title_full_unstemmed | Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength |
title_short | Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength |
title_sort | systematic investigation of machine learning on limited data a study on predicting protein protein binding strength |
topic | Protein-protein binding affinity Machine learning methods Tools |
url | http://www.sciencedirect.com/science/article/pii/S2001037023004920 |
work_keys_str_mv | AT feifanzheng systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength AT xinjiang systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength AT yuhaowen systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength AT yanyang systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength AT minghuili systematicinvestigationofmachinelearningonlimiteddataastudyonpredictingproteinproteinbindingstrength |