EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets

Summary: This work introduces the EXSCLAIM! toolkit for the automatic extraction, separation, and caption-based natural language annotation of images from scientific literature. EXSCLAIM! is used to show how rule-based natural language processing and image recognition can be leveraged to construct a...

Full description

Bibliographic Details
Main Authors: Eric Schwenker, Weixin Jiang, Trevor Spreadbury, Nicola Ferrier, Oliver Cossairt, Maria K.Y. Chan
Format: Article
Language:English
Published: Elsevier 2023-11-01
Series:Patterns
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666389923002222
_version_ 1827764820293713920
author Eric Schwenker
Weixin Jiang
Trevor Spreadbury
Nicola Ferrier
Oliver Cossairt
Maria K.Y. Chan
author_facet Eric Schwenker
Weixin Jiang
Trevor Spreadbury
Nicola Ferrier
Oliver Cossairt
Maria K.Y. Chan
author_sort Eric Schwenker
collection DOAJ
description Summary: This work introduces the EXSCLAIM! toolkit for the automatic extraction, separation, and caption-based natural language annotation of images from scientific literature. EXSCLAIM! is used to show how rule-based natural language processing and image recognition can be leveraged to construct an electron microscopy dataset containing thousands of keyword-annotated nanostructure images. Moreover, it is demonstrated how a combination of statistical topic modeling and semantic word similarity comparisons can be used to increase the number and variety of keyword annotations on top of the standard annotations from EXSCLAIM! With large-scale imaging datasets constructed from scientific literature, users are well positioned to train neural networks for classification and recognition tasks specific to microscopy—tasks often otherwise inhibited by a lack of sufficient annotated training data. The bigger picture: Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for data ingestion scenarios where a selection of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are part of a larger figure, with their explicit context buried in the main body or caption text; so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. The tool developed in this work establishes a scalable pipeline for meaningful image-/language-based information curation from scientific literature.
first_indexed 2024-03-11T11:09:12Z
format Article
id doaj.art-2328be418338488a8af12a93d4bd4123
institution Directory Open Access Journal
issn 2666-3899
language English
last_indexed 2024-03-11T11:09:12Z
publishDate 2023-11-01
publisher Elsevier
record_format Article
series Patterns
spelling doaj.art-2328be418338488a8af12a93d4bd41232023-11-12T04:41:03ZengElsevierPatterns2666-38992023-11-01411100843EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasetsEric Schwenker0Weixin Jiang1Trevor Spreadbury2Nicola Ferrier3Oliver Cossairt4Maria K.Y. Chan5Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Materials Science and Engineering, Northwestern University, Evanston, IL 60208, USA; Corresponding authorCenter for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Computer Science, Northwestern University, Evanston, IL 60208, USACenter for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Computer Science, Northwestern University, Evanston, IL 60208, USAMathematics and Computer Science, Argonne National Laboratory, Argonne, IL 60439, USADepartment of Computer Science, Northwestern University, Evanston, IL 60208, USACenter for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Corresponding authorSummary: This work introduces the EXSCLAIM! toolkit for the automatic extraction, separation, and caption-based natural language annotation of images from scientific literature. EXSCLAIM! is used to show how rule-based natural language processing and image recognition can be leveraged to construct an electron microscopy dataset containing thousands of keyword-annotated nanostructure images. Moreover, it is demonstrated how a combination of statistical topic modeling and semantic word similarity comparisons can be used to increase the number and variety of keyword annotations on top of the standard annotations from EXSCLAIM! With large-scale imaging datasets constructed from scientific literature, users are well positioned to train neural networks for classification and recognition tasks specific to microscopy—tasks often otherwise inhibited by a lack of sufficient annotated training data. The bigger picture: Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for data ingestion scenarios where a selection of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are part of a larger figure, with their explicit context buried in the main body or caption text; so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. The tool developed in this work establishes a scalable pipeline for meaningful image-/language-based information curation from scientific literature.http://www.sciencedirect.com/science/article/pii/S2666389923002222DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
spellingShingle Eric Schwenker
Weixin Jiang
Trevor Spreadbury
Nicola Ferrier
Oliver Cossairt
Maria K.Y. Chan
EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
Patterns
DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
title EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
title_full EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
title_fullStr EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
title_full_unstemmed EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
title_short EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets
title_sort exsclaim harnessing materials science literature for self labeled microscopy datasets
topic DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
url http://www.sciencedirect.com/science/article/pii/S2666389923002222
work_keys_str_mv AT ericschwenker exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets
AT weixinjiang exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets
AT trevorspreadbury exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets
AT nicolaferrier exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets
AT olivercossairt exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets
AT mariakychan exsclaimharnessingmaterialsscienceliteratureforselflabeledmicroscopydatasets