Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures

Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learnin...

Full description

Bibliographic Details
Main Authors: Xueqing Chen, Yang Gao, Ludi Wang, Wenjuan Cui, Jiamin Huang, Yi Du, Bin Wang
Format: Article
Language:English
Published: Nature Portfolio 2024-04-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-03180-9
_version_ 1797220013111771136
author Xueqing Chen
Yang Gao
Ludi Wang
Wenjuan Cui
Jiamin Huang
Yi Du
Bin Wang
author_facet Xueqing Chen
Yang Gao
Ludi Wang
Wenjuan Cui
Jiamin Huang
Yi Du
Bin Wang
author_sort Xueqing Chen
collection DOAJ
description Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.
first_indexed 2024-04-24T12:42:47Z
format Article
id doaj.art-9956e1ef7b8840b9a895372b2cbe8d8f
institution Directory Open Access Journal
issn 2052-4463
language English
last_indexed 2024-04-24T12:42:47Z
publishDate 2024-04-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj.art-9956e1ef7b8840b9a895372b2cbe8d8f2024-04-07T11:08:03ZengNature PortfolioScientific Data2052-44632024-04-0111111210.1038/s41597-024-03180-9Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis proceduresXueqing Chen0Yang Gao1Ludi Wang2Wenjuan Cui3Jiamin Huang4Yi Du5Bin Wang6Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesLaboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.https://doi.org/10.1038/s41597-024-03180-9
spellingShingle Xueqing Chen
Yang Gao
Ludi Wang
Wenjuan Cui
Jiamin Huang
Yi Du
Bin Wang
Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
Scientific Data
title Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
title_full Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
title_fullStr Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
title_full_unstemmed Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
title_short Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
title_sort large language model enhanced corpus of co2 reduction electrocatalysts and synthesis procedures
url https://doi.org/10.1038/s41597-024-03180-9
work_keys_str_mv AT xueqingchen largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT yanggao largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT ludiwang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT wenjuancui largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT jiaminhuang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT yidu largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures
AT binwang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures