Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing stud...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10480707/ |
_version_ | 1797222007229644800 |
---|---|
author | Suehyun Chang Geun-Jin Ahn Sungbum Park |
author_facet | Suehyun Chang Geun-Jin Ahn Sungbum Park |
author_sort | Suehyun Chang |
collection | DOAJ |
description | Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at <uri>https://github.com/guenjinahn/hoseo-cedr</uri>. |
first_indexed | 2024-04-24T13:14:29Z |
format | Article |
id | doaj.art-ebe67daf5f2f4d338a88580f2e8df504 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-24T13:14:29Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-ebe67daf5f2f4d338a88580f2e8df5042024-04-04T23:00:23ZengIEEEIEEE Access2169-35362024-01-0112468514686310.1109/ACCESS.2024.338219010480707Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision MethodSuehyun Chang0Geun-Jin Ahn1Sungbum Park2https://orcid.org/0000-0002-5176-3003Duriam IP Law Firm, Seocho-gu, Seoul, Republic of KoreaDepartment of Management of IT, Hoseo University, Asan, Republic of KoreaDepartment of Management of IT, Hoseo University, Asan, Republic of KoreaRecently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at <uri>https://github.com/guenjinahn/hoseo-cedr</uri>.https://ieeexplore.ieee.org/document/10480707/Information retrievalnatural language processingdeep neural networkbest match 25 (BM25)bidirectional encoder representations from transformers (BERT) |
spellingShingle | Suehyun Chang Geun-Jin Ahn Sungbum Park Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method IEEE Access Information retrieval natural language processing deep neural network best match 25 (BM25) bidirectional encoder representations from transformers (BERT) |
title | Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method |
title_full | Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method |
title_fullStr | Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method |
title_full_unstemmed | Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method |
title_short | Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method |
title_sort | improving performance of neural ir models by using a keyword extraction based weak supervision method |
topic | Information retrieval natural language processing deep neural network best match 25 (BM25) bidirectional encoder representations from transformers (BERT) |
url | https://ieeexplore.ieee.org/document/10480707/ |
work_keys_str_mv | AT suehyunchang improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod AT geunjinahn improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod AT sungbumpark improvingperformanceofneuralirmodelsbyusingakeywordextractionbasedweaksupervisionmethod |