An open source chemical structure curation pipeline using RDKit

Abstract Background The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final databa...

Full description

Bibliographic Details
Main Authors: A. Patrícia Bento, Anne Hersey, Eloy Félix, Greg Landrum, Anna Gaulton, Francis Atkinson, Louisa J. Bellis, Marleen De Veij, Andrew R. Leach
Format: Article
Language:English
Published: BMC 2020-09-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-020-00456-1
_version_ 1818538207387582464
author A. Patrícia Bento
Anne Hersey
Eloy Félix
Greg Landrum
Anna Gaulton
Francis Atkinson
Louisa J. Bellis
Marleen De Veij
Andrew R. Leach
author_facet A. Patrícia Bento
Anne Hersey
Eloy Félix
Greg Landrum
Anna Gaulton
Francis Atkinson
Louisa J. Bellis
Marleen De Veij
Andrew R. Leach
author_sort A. Patrícia Bento
collection DOAJ
description Abstract Background The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. Results A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. Conclusion All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.
first_indexed 2024-12-11T21:26:00Z
format Article
id doaj.art-bbbf7ee22f154d9a97199854b9ea5039
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-12-11T21:26:00Z
publishDate 2020-09-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-bbbf7ee22f154d9a97199854b9ea50392022-12-22T00:50:20ZengBMCJournal of Cheminformatics1758-29462020-09-0112111610.1186/s13321-020-00456-1An open source chemical structure curation pipeline using RDKitA. Patrícia Bento0Anne Hersey1Eloy Félix2Greg Landrum3Anna Gaulton4Francis Atkinson5Louisa J. Bellis6Marleen De Veij7Andrew R. Leach8European Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteT5 Informatics GmbHEuropean Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteEuropean Molecular Biology Laboratory, European Bioinformatics InstituteAbstract Background The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. Results A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. Conclusion All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.http://link.springer.com/article/10.1186/s13321-020-00456-1ChemistryCurationChEMBLRDKitOpen sourceStandardisation
spellingShingle A. Patrícia Bento
Anne Hersey
Eloy Félix
Greg Landrum
Anna Gaulton
Francis Atkinson
Louisa J. Bellis
Marleen De Veij
Andrew R. Leach
An open source chemical structure curation pipeline using RDKit
Journal of Cheminformatics
Chemistry
Curation
ChEMBL
RDKit
Open source
Standardisation
title An open source chemical structure curation pipeline using RDKit
title_full An open source chemical structure curation pipeline using RDKit
title_fullStr An open source chemical structure curation pipeline using RDKit
title_full_unstemmed An open source chemical structure curation pipeline using RDKit
title_short An open source chemical structure curation pipeline using RDKit
title_sort open source chemical structure curation pipeline using rdkit
topic Chemistry
Curation
ChEMBL
RDKit
Open source
Standardisation
url http://link.springer.com/article/10.1186/s13321-020-00456-1
work_keys_str_mv AT apatriciabento anopensourcechemicalstructurecurationpipelineusingrdkit
AT annehersey anopensourcechemicalstructurecurationpipelineusingrdkit
AT eloyfelix anopensourcechemicalstructurecurationpipelineusingrdkit
AT greglandrum anopensourcechemicalstructurecurationpipelineusingrdkit
AT annagaulton anopensourcechemicalstructurecurationpipelineusingrdkit
AT francisatkinson anopensourcechemicalstructurecurationpipelineusingrdkit
AT louisajbellis anopensourcechemicalstructurecurationpipelineusingrdkit
AT marleendeveij anopensourcechemicalstructurecurationpipelineusingrdkit
AT andrewrleach anopensourcechemicalstructurecurationpipelineusingrdkit
AT apatriciabento opensourcechemicalstructurecurationpipelineusingrdkit
AT annehersey opensourcechemicalstructurecurationpipelineusingrdkit
AT eloyfelix opensourcechemicalstructurecurationpipelineusingrdkit
AT greglandrum opensourcechemicalstructurecurationpipelineusingrdkit
AT annagaulton opensourcechemicalstructurecurationpipelineusingrdkit
AT francisatkinson opensourcechemicalstructurecurationpipelineusingrdkit
AT louisajbellis opensourcechemicalstructurecurationpipelineusingrdkit
AT marleendeveij opensourcechemicalstructurecurationpipelineusingrdkit
AT andrewrleach opensourcechemicalstructurecurationpipelineusingrdkit