Accelerating materials language processing with large language models

Abstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, ex...

Full description

Bibliographic Details
Main Authors: Jaewoong Choi, Byungju Lee
Format: Article
Language:English
Published: Nature Portfolio 2024-02-01
Series:Communications Materials
Online Access:https://doi.org/10.1038/s43246-024-00449-9
_version_ 1827326750862868480
author Jaewoong Choi
Byungju Lee
author_facet Jaewoong Choi
Byungju Lee
author_sort Jaewoong Choi
collection DOAJ
description Abstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, extensive fine-tuning, and substantial human-labelled datasets. Here, we introduce the use of large language models, such as generative pretrained transformer (GPT), to replace the complex architectures of prior MLP models with strategic designs of prompt engineering. We find that in-context learning of GPT models with few or zero-shots can provide high performance text classification, named entity recognition and extractive question answering with limited datasets, demonstrated for various classes of materials. These generative models can also help identify incorrect annotated data. Our GPT-based approach can assist material scientists in solving knowledge-intensive MLP tasks, even if they lack relevant expertise, by offering MLP guidelines applicable to any materials science domain. In addition, the outcomes of GPT models are expected to reduce the workload of researchers, such as manual labelling, by producing an initial labelling set and verifying human-annotations.
first_indexed 2024-03-07T14:50:01Z
format Article
id doaj.art-e2aacc1a360145fc81f23b46b9220e78
institution Directory Open Access Journal
issn 2662-4443
language English
last_indexed 2024-03-07T14:50:01Z
publishDate 2024-02-01
publisher Nature Portfolio
record_format Article
series Communications Materials
spelling doaj.art-e2aacc1a360145fc81f23b46b9220e782024-03-05T19:46:24ZengNature PortfolioCommunications Materials2662-44432024-02-015111110.1038/s43246-024-00449-9Accelerating materials language processing with large language modelsJaewoong Choi0Byungju Lee1Computational Science Research Center, Korea Institute of Science and TechnologyComputational Science Research Center, Korea Institute of Science and TechnologyAbstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, extensive fine-tuning, and substantial human-labelled datasets. Here, we introduce the use of large language models, such as generative pretrained transformer (GPT), to replace the complex architectures of prior MLP models with strategic designs of prompt engineering. We find that in-context learning of GPT models with few or zero-shots can provide high performance text classification, named entity recognition and extractive question answering with limited datasets, demonstrated for various classes of materials. These generative models can also help identify incorrect annotated data. Our GPT-based approach can assist material scientists in solving knowledge-intensive MLP tasks, even if they lack relevant expertise, by offering MLP guidelines applicable to any materials science domain. In addition, the outcomes of GPT models are expected to reduce the workload of researchers, such as manual labelling, by producing an initial labelling set and verifying human-annotations.https://doi.org/10.1038/s43246-024-00449-9
spellingShingle Jaewoong Choi
Byungju Lee
Accelerating materials language processing with large language models
Communications Materials
title Accelerating materials language processing with large language models
title_full Accelerating materials language processing with large language models
title_fullStr Accelerating materials language processing with large language models
title_full_unstemmed Accelerating materials language processing with large language models
title_short Accelerating materials language processing with large language models
title_sort accelerating materials language processing with large language models
url https://doi.org/10.1038/s43246-024-00449-9
work_keys_str_mv AT jaewoongchoi acceleratingmaterialslanguageprocessingwithlargelanguagemodels
AT byungjulee acceleratingmaterialslanguageprocessingwithlargelanguagemodels