Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience c...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10459082/ |
_version_ | 1827290008312086528 |
---|---|
author | Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee |
author_facet | Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee |
author_sort | Yongwoo Jeong |
collection | DOAJ |
description | The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer. |
first_indexed | 2024-04-24T12:00:48Z |
format | Article |
id | doaj.art-5d4fd406345c44cbb5dda891c9f791b5 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-24T12:00:48Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-5d4fd406345c44cbb5dda891c9f791b52024-04-08T23:00:46ZengIEEEIEEE Access2169-35362024-01-0112481454815710.1109/ACCESS.2024.337347010459082Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment ServicesYongwoo Jeong0https://orcid.org/0009-0006-5746-6380Jiseon Yang1https://orcid.org/0009-0001-0694-4833In Ho Choi2Juyeon Lee3Rowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaRowan Inc., Seoul, Republic of KoreaThe fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.https://ieeexplore.ieee.org/document/10459082/Data diversityDistilBERTembeddedGPT2KoBERTKorean |
spellingShingle | Yongwoo Jeong Jiseon Yang In Ho Choi Juyeon Lee Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services IEEE Access Data diversity DistilBERT embedded GPT2 KoBERT Korean |
title | Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services |
title_full | Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services |
title_fullStr | Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services |
title_full_unstemmed | Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services |
title_short | Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services |
title_sort | feature based text search engine mitigating data diversity problem using pre trained large language model for fast deployment services |
topic | Data diversity DistilBERT embedded GPT2 KoBERT Korean |
url | https://ieeexplore.ieee.org/document/10459082/ |
work_keys_str_mv | AT yongwoojeong featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT jiseonyang featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT inhochoi featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices AT juyeonlee featurebasedtextsearchenginemitigatingdatadiversityproblemusingpretrainedlargelanguagemodelforfastdeploymentservices |