DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature

Abstract Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual...

Full description

Bibliographic Details
Main Authors: Kohulan Rajan, Henning Otto Brinkhaus, Maria Sorokina, Achim Zielesny, Christoph Steinbeck
Format: Article
Language:English
Published: BMC 2021-03-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-021-00496-1
_version_ 1818649065970204672
author Kohulan Rajan
Henning Otto Brinkhaus
Maria Sorokina
Achim Zielesny
Christoph Steinbeck
author_facet Kohulan Rajan
Henning Otto Brinkhaus
Maria Sorokina
Achim Zielesny
Christoph Steinbeck
author_sort Kohulan Rajan
collection DOAJ
description Abstract Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai , lets the user upload a pdf file and retrieve the segmented structure depictions.
first_indexed 2024-12-17T01:28:24Z
format Article
id doaj.art-19f62439eb544a07b3b5faebf5ea6bf7
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-12-17T01:28:24Z
publishDate 2021-03-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-19f62439eb544a07b3b5faebf5ea6bf72022-12-21T22:08:38ZengBMCJournal of Cheminformatics1758-29462021-03-011311910.1186/s13321-021-00496-1DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literatureKohulan Rajan0Henning Otto Brinkhaus1Maria Sorokina2Achim Zielesny3Christoph Steinbeck4Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University JenaInstitute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University JenaInstitute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University JenaInstitute for Bioinformatics and Chemoinformatics, Westphalian University of Applied SciencesInstitute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University JenaAbstract Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai , lets the user upload a pdf file and retrieve the segmented structure depictions.https://doi.org/10.1186/s13321-021-00496-1Deep learningImage SegmentationOptical Chemical Structure RecognitionNeural NetworksChemical data extraction
spellingShingle Kohulan Rajan
Henning Otto Brinkhaus
Maria Sorokina
Achim Zielesny
Christoph Steinbeck
DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
Journal of Cheminformatics
Deep learning
Image Segmentation
Optical Chemical Structure Recognition
Neural Networks
Chemical data extraction
title DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_full DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_fullStr DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_full_unstemmed DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_short DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_sort decimer segmentation automated extraction of chemical structure depictions from scientific literature
topic Deep learning
Image Segmentation
Optical Chemical Structure Recognition
Neural Networks
Chemical data extraction
url https://doi.org/10.1186/s13321-021-00496-1
work_keys_str_mv AT kohulanrajan decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT henningottobrinkhaus decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT mariasorokina decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT achimzielesny decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT christophsteinbeck decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature