Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information

Keyword pools are used as search queries to collect web texts, largely determining the size and coverage of the samples and provide a data base for subsequent text mining. However, how to generate a refined keyword pool with high similarity and some expandability is a challenge. Currently, keyword p...

Full description

Bibliographic Details
Main Authors: Xiaolong Wu, Chong Feng, Qiyuan Li, Jianping Zhu
Format: Article
Language:English
Published: MDPI AG 2024-01-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/12/3/405
_version_ 1797318447258927104
author Xiaolong Wu
Chong Feng
Qiyuan Li
Jianping Zhu
author_facet Xiaolong Wu
Chong Feng
Qiyuan Li
Jianping Zhu
author_sort Xiaolong Wu
collection DOAJ
description Keyword pools are used as search queries to collect web texts, largely determining the size and coverage of the samples and provide a data base for subsequent text mining. However, how to generate a refined keyword pool with high similarity and some expandability is a challenge. Currently, keyword pools for search queries aimed at collecting web texts either lack an objective generation method and evaluation system, or have a low utilization rate of sample semantic information. Therefore, this paper proposed a keyword generation framework that integrates sample and semantic information to construct a complete and objective keyword pool generation and evaluation system. The framework includes a data phase and a modeling phase, and its core is in the modeling phase, where both feature ranking and model performance are considered. A regression model about a topic vector and word vectors is constructed for the first time based on word embedding, and keyword pools are generated from the perspective of model performance. In addition, two keyword generation methods, Recursive Feature Introduction (RFI) and Recursive Feature Introduction and Elimination (RFIE), are also proposed in this paper. Different feature ranking algorithms, keyword generation methods and regression models are compared in the experiments. The results show that: (1) When using RFI to generate keywords, the regression model using ranked features has better prediction performance than the baseline model, and the number of generated keywords is refiner, and the prediction performance of the regression model using tree-based ranked features is significantly better than that of the one using SHAP-based ranked features. (2) The prediction performance of the regression model using RFI with tree-based ranked features is significantly better than that using Recursive Feature Elimination (RFE) with tree-based one. (3) All four regression models using RFI/RFE with SHAP- based/tree-based ranked features have significantly higher average similarity scores and cumulative advantages than the baseline model (the model using RFI with unranked features). (4) Light Gradient Boosting Machine (LGBM) using RFI with SHAP-based ranked features has significantly better prediction performance, higher average similarity scores, and cumulative advantages. In conclusion, our framework can generate a keyword pool that is more similar to the topic, and more refined and expandable, which provides certain research ideas for expanding the research sample size while ensuring the coverage of topics in web text collecting.
first_indexed 2024-03-08T03:52:32Z
format Article
id doaj.art-38d3b076408e4751892cd85c710ebfd1
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-08T03:52:32Z
publishDate 2024-01-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-38d3b076408e4751892cd85c710ebfd12024-02-09T15:18:14ZengMDPI AGMathematics2227-73902024-01-0112340510.3390/math12030405Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic InformationXiaolong Wu0Chong Feng1Qiyuan Li2Jianping Zhu3School of Medicine, Xiamen University, Xiamen 361105, ChinaData Mining Research Center, Xiamen University, Xiamen 361005, ChinaSchool of Medicine, Xiamen University, Xiamen 361105, ChinaNational Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361105, ChinaKeyword pools are used as search queries to collect web texts, largely determining the size and coverage of the samples and provide a data base for subsequent text mining. However, how to generate a refined keyword pool with high similarity and some expandability is a challenge. Currently, keyword pools for search queries aimed at collecting web texts either lack an objective generation method and evaluation system, or have a low utilization rate of sample semantic information. Therefore, this paper proposed a keyword generation framework that integrates sample and semantic information to construct a complete and objective keyword pool generation and evaluation system. The framework includes a data phase and a modeling phase, and its core is in the modeling phase, where both feature ranking and model performance are considered. A regression model about a topic vector and word vectors is constructed for the first time based on word embedding, and keyword pools are generated from the perspective of model performance. In addition, two keyword generation methods, Recursive Feature Introduction (RFI) and Recursive Feature Introduction and Elimination (RFIE), are also proposed in this paper. Different feature ranking algorithms, keyword generation methods and regression models are compared in the experiments. The results show that: (1) When using RFI to generate keywords, the regression model using ranked features has better prediction performance than the baseline model, and the number of generated keywords is refiner, and the prediction performance of the regression model using tree-based ranked features is significantly better than that of the one using SHAP-based ranked features. (2) The prediction performance of the regression model using RFI with tree-based ranked features is significantly better than that using Recursive Feature Elimination (RFE) with tree-based one. (3) All four regression models using RFI/RFE with SHAP- based/tree-based ranked features have significantly higher average similarity scores and cumulative advantages than the baseline model (the model using RFI with unranked features). (4) Light Gradient Boosting Machine (LGBM) using RFI with SHAP-based ranked features has significantly better prediction performance, higher average similarity scores, and cumulative advantages. In conclusion, our framework can generate a keyword pool that is more similar to the topic, and more refined and expandable, which provides certain research ideas for expanding the research sample size while ensuring the coverage of topics in web text collecting.https://www.mdpi.com/2227-7390/12/3/405keyword pool generationweb text collectingsearch queryword embeddingfeature rankingfeature selection
spellingShingle Xiaolong Wu
Chong Feng
Qiyuan Li
Jianping Zhu
Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
Mathematics
keyword pool generation
web text collecting
search query
word embedding
feature ranking
feature selection
title Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
title_full Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
title_fullStr Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
title_full_unstemmed Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
title_short Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
title_sort keyword pool generation for web text collecting a framework integrating sample and semantic information
topic keyword pool generation
web text collecting
search query
word embedding
feature ranking
feature selection
url https://www.mdpi.com/2227-7390/12/3/405
work_keys_str_mv AT xiaolongwu keywordpoolgenerationforwebtextcollectingaframeworkintegratingsampleandsemanticinformation
AT chongfeng keywordpoolgenerationforwebtextcollectingaframeworkintegratingsampleandsemanticinformation
AT qiyuanli keywordpoolgenerationforwebtextcollectingaframeworkintegratingsampleandsemanticinformation
AT jianpingzhu keywordpoolgenerationforwebtextcollectingaframeworkintegratingsampleandsemanticinformation