Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience c...

Full description

Bibliographic Details
Main Authors: Yongwoo Jeong, Jiseon Yang, In Ho Choi, Juyeon Lee
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10459082/
_version_ 1827290008312086528
author Yongwoo Jeong
Jiseon Yang
In Ho Choi
Juyeon Lee
author_facet Yongwoo Jeong
Jiseon Yang
In Ho Choi
Juyeon Lee
author_sort Yongwoo Jeong
collection DOAJ
description The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.
first_indexed 2024-04-24T12:00:48Z
format Article
id doaj.art-5d4fd406345c44cbb5dda891c9f791b5
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-24T12:00:48Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-5d4fd406345c44cbb5dda891c9f791b52024-04-08T23:00:46ZengIEEEIEEE Access2169-35362024-01-0112481454815710.1109/ACCESS.2024.337347010459082Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment ServicesYongwoo Jeong0https://orcid.org/0009-0006-5746-6380Jiseon Yang1https://orcid.org/0009-0001-0694-4833In Ho Choi2Juyeon Lee3Rowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaThe fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.https://ieeexplore.ieee.org/document/10459082/Data diversityDistilBERTembeddedGPT2KoBERTKorean
spellingShingle Yongwoo Jeong
Jiseon Yang
In Ho Choi
Juyeon Lee
Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
IEEE Access
Data diversity
DistilBERT
embedded
GPT2
KoBERT
Korean
title Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_full Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_fullStr Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_full_unstemmed Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_short Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
title_sort feature based text search engine mitigating data diversity problem using pre trained large language model for fast deployment services
topic Data diversity
DistilBERT
embedded
GPT2
KoBERT
Korean
url https://ieeexplore.ieee.org/document/10459082/
work_keys_str_mv AT yongwoojeong featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices
AT jiseonyang featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices
AT inhochoi featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices
AT juyeonlee featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices