Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures
Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learnin...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2024-04-01
|
Series: | Scientific Data |
Online Access: | https://doi.org/10.1038/s41597-024-03180-9 |
_version_ | 1797220013111771136 |
---|---|
author | Xueqing Chen Yang Gao Ludi Wang Wenjuan Cui Jiamin Huang Yi Du Bin Wang |
author_facet | Xueqing Chen Yang Gao Ludi Wang Wenjuan Cui Jiamin Huang Yi Du Bin Wang |
author_sort | Xueqing Chen |
collection | DOAJ |
description | Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques. |
first_indexed | 2024-04-24T12:42:47Z |
format | Article |
id | doaj.art-9956e1ef7b8840b9a895372b2cbe8d8f |
institution | Directory Open Access Journal |
issn | 2052-4463 |
language | English |
last_indexed | 2024-04-24T12:42:47Z |
publishDate | 2024-04-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Data |
spelling | doaj.art-9956e1ef7b8840b9a895372b2cbe8d8f2024-04-07T11:08:03ZengNature PortfolioScientific Data2052-44632024-04-0111111210.1038/s41597-024-03180-9Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis proceduresXueqing Chen0Yang Gao1Ludi Wang2Wenjuan Cui3Jiamin Huang4Yi Du5Bin Wang6Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesLaboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of SciencesCAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST)Abstract CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.https://doi.org/10.1038/s41597-024-03180-9 |
spellingShingle | Xueqing Chen Yang Gao Ludi Wang Wenjuan Cui Jiamin Huang Yi Du Bin Wang Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures Scientific Data |
title | Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures |
title_full | Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures |
title_fullStr | Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures |
title_full_unstemmed | Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures |
title_short | Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures |
title_sort | large language model enhanced corpus of co2 reduction electrocatalysts and synthesis procedures |
url | https://doi.org/10.1038/s41597-024-03180-9 |
work_keys_str_mv | AT xueqingchen largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT yanggao largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT ludiwang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT wenjuancui largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT jiaminhuang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT yidu largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures AT binwang largelanguagemodelenhancedcorpusofco2reductionelectrocatalystsandsynthesisprocedures |