Accelerating materials language processing with large language models

Abstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, ex...

Full description

Bibliographic Details
Main Authors:	Jaewoong Choi, Byungju Lee
Format:	Article
Language:	English
Published:	Nature Portfolio 2024-02-01
Series:	Communications Materials
Online Access:	https://doi.org/10.1038/s43246-024-00449-9

_version_	1827326750862868480
author	Jaewoong Choi Byungju Lee
author_facet	Jaewoong Choi Byungju Lee
author_sort	Jaewoong Choi
collection	DOAJ
description	Abstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, extensive fine-tuning, and substantial human-labelled datasets. Here, we introduce the use of large language models, such as generative pretrained transformer (GPT), to replace the complex architectures of prior MLP models with strategic designs of prompt engineering. We find that in-context learning of GPT models with few or zero-shots can provide high performance text classification, named entity recognition and extractive question answering with limited datasets, demonstrated for various classes of materials. These generative models can also help identify incorrect annotated data. Our GPT-based approach can assist material scientists in solving knowledge-intensive MLP tasks, even if they lack relevant expertise, by offering MLP guidelines applicable to any materials science domain. In addition, the outcomes of GPT models are expected to reduce the workload of researchers, such as manual labelling, by producing an initial labelling set and verifying human-annotations.
first_indexed	2024-03-07T14:50:01Z
format	Article
id	doaj.art-e2aacc1a360145fc81f23b46b9220e78
institution	Directory Open Access Journal
issn	2662-4443
language	English
last_indexed	2024-03-07T14:50:01Z
publishDate	2024-02-01
publisher	Nature Portfolio
record_format	Article
series	Communications Materials
spelling	doaj.art-e2aacc1a360145fc81f23b46b9220e782024-03-05T19:46:24ZengNature PortfolioCommunications Materials2662-44432024-02-015111110.1038/s43246-024-00449-9Accelerating materials language processing with large language modelsJaewoong Choi0Byungju Lee1Computational Science Research Center, Korea Institute of Science and TechnologyComputational Science Research Center, Korea Institute of Science and TechnologyAbstract Materials language processing (MLP) can facilitate materials science research by automating the extraction of structured data from research papers. Despite the existence of deep learning models for MLP tasks, there are ongoing practical issues associated with complex model architectures, extensive fine-tuning, and substantial human-labelled datasets. Here, we introduce the use of large language models, such as generative pretrained transformer (GPT), to replace the complex architectures of prior MLP models with strategic designs of prompt engineering. We find that in-context learning of GPT models with few or zero-shots can provide high performance text classification, named entity recognition and extractive question answering with limited datasets, demonstrated for various classes of materials. These generative models can also help identify incorrect annotated data. Our GPT-based approach can assist material scientists in solving knowledge-intensive MLP tasks, even if they lack relevant expertise, by offering MLP guidelines applicable to any materials science domain. In addition, the outcomes of GPT models are expected to reduce the workload of researchers, such as manual labelling, by producing an initial labelling set and verifying human-annotations.https://doi.org/10.1038/s43246-024-00449-9
spellingShingle	Jaewoong Choi Byungju Lee Accelerating materials language processing with large language models Communications Materials
title	Accelerating materials language processing with large language models
title_full	Accelerating materials language processing with large language models
title_fullStr	Accelerating materials language processing with large language models
title_full_unstemmed	Accelerating materials language processing with large language models
title_short	Accelerating materials language processing with large language models
title_sort	accelerating materials language processing with large language models
url	https://doi.org/10.1038/s43246-024-00449-9
work_keys_str_mv	AT jaewoongchoi acceleratingmaterialslanguageprocessingwithlargelanguagemodels AT byungjulee acceleratingmaterialslanguageprocessingwithlargelanguagemodels

Accelerating materials language processing with large language models

Similar Items