Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience c...

Full description

Bibliographic Details
Main Authors:	Yongwoo Jeong, Jiseon Yang, In Ho Choi, Juyeon Lee
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Data diversity DistilBERT embedded GPT2 KoBERT Korean
Online Access:	https://ieeexplore.ieee.org/document/10459082/

_version_	1827290008312086528
author	Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee
author_facet	Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee
author_sort	Yongwoo Jeong
collection	DOAJ
description	The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.
first_indexed	2024-04-24T12:00:48Z
format	Article
id	doaj.art-5d4fd406345c44cbb5dda891c9f791b5
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T12:00:48Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-5d4fd406345c44cbb5dda891c9f791b52024-04-08T23:00:46ZengIEEEIEEE Access2169-35362024-01-0112481454815710.1109/ACCESS.2024.337347010459082Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment ServicesYongwoo Jeong0https://orcid.org/0009-0006-5746-6380Jiseon Yang1https://orcid.org/0009-0001-0694-4833In Ho Choi2Juyeon Lee3Rowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaThe fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.https://ieeexplore.ieee.org/document/10459082/Data diversityDistilBERTembeddedGPT2KoBERTKorean
spellingShingle	Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services IEEE Access Data diversity DistilBERT embedded GPT2 KoBERT Korean
title	Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_full	Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_fullStr	Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_full_unstemmed	Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_short	Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_sort	feature based text search engine mitigating data diversity problem using pre trained large language model for fast deployment services
topic	Data diversity DistilBERT embedded GPT2 KoBERT Korean
url	https://ieeexplore.ieee.org/document/10459082/
work_keys_str_mv	AT yongwoojeong featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT jiseonyang featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT inhochoi featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT juyeonlee featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices

Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

Similar Items