Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method

Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing stud...

Full description

Bibliographic Details
Main Authors: Suehyun Chang, Geun-Jin Ahn, Sungbum Park
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10480707/
_version_ 1797222007229644800
author Suehyun Chang
Geun-Jin Ahn
Sungbum Park
author_facet Suehyun Chang
Geun-Jin Ahn
Sungbum Park
author_sort Suehyun Chang
collection DOAJ
description Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at <uri>https://github.com/guenjinahn/hoseo-cedr</uri>.
first_indexed 2024-04-24T13:14:29Z
format Article
id doaj.art-ebe67daf5f2f4d338a88580f2e8df504
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-24T13:14:29Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-ebe67daf5f2f4d338a88580f2e8df5042024-04-04T23:00:23ZengIEEEIEEE Access2169-35362024-01-0112468514686310.1109/ACCESS.2024.338219010480707Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision MethodSuehyun Chang0Geun-Jin Ahn1Sungbum Park2https://orcid.org/0000-0002-5176-3003Duriam IP Law Firm, Seocho-gu, Seoul, Republic of KoreaDepartment of Management of IT, Hoseo University, Asan, Republic of KoreaDepartment of Management of IT, Hoseo University, Asan, Republic of KoreaRecently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at <uri>https://github.com/guenjinahn/hoseo-cedr</uri>.https://ieeexplore.ieee.org/document/10480707/Information retrievalnatural language processingdeep neural networkbest match 25 (BM25)bidirectional encoder representations from transformers (BERT)
spellingShingle Suehyun Chang
Geun-Jin Ahn
Sungbum Park
Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
IEEE Access
Information retrieval
natural language processing
deep neural network
best match 25 (BM25)
bidirectional encoder representations from transformers (BERT)
title Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
title_full Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
title_fullStr Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
title_full_unstemmed Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
title_short Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
title_sort improving performance of neural ir models by using a keyword extraction based weak supervision method
topic Information retrieval
natural language processing
deep neural network
best match 25 (BM25)
bidirectional encoder representations from transformers (BERT)
url https://ieeexplore.ieee.org/document/10480707/
work_keys_str_mv AT suehyunchang improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod
AT geunjinahn improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod
AT sungbumpark improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod