Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling

There are two main problems on forming the Automatic Essay Scoring Model. They are the datasets having imbalanced amount of the right and wrong answers and the minimal use of labeled data in the model training. The model forming based on these problems is divided into three main points, namely...

Full description

Bibliographic Details
Main Authors:	Pramularsih, Marvina, Riasetiawan, Mardhani
Format:	Other
Language:	English
Published:	International Journal of Advanced Computer Science and Applications 2022
Subjects:	Distributed Computing
Online Access:	https://repository.ugm.ac.id/284295/1/Solving-the-Imbalanced-and-Limited-Data-Labeled-for-Automated-Essay-Scoring-using-Cost-Sensitive-XGBoost-and-PseudoLabelingInternational-Journal-of-Advanced-Computer-Science-and-Applications.pdf

_version_	1826050778072088576
author	Pramularsih, Marvina Riasetiawan, Mardhani
author_facet	Pramularsih, Marvina Riasetiawan, Mardhani
author_sort	Pramularsih, Marvina
collection	UGM
description	There are two main problems on forming the Automatic Essay Scoring Model. They are the datasets having imbalanced amount of the right and wrong answers and the minimal use of labeled data in the model training. The model forming based on these problems is divided into three main points, namely word representation, Cost-Sensitive XGBoost Classification, and adding unlabeled data with the PseudoLabeling Technique. The essay answer data is converted into a vector using the trained word vector fastText. Furthermore, the classification of unlabeled data was carried out using the CostSensitive XGBoost Method. The data labeled by the classification model is added as training data for the new classification model form. The process is carried out iteratively. This research is about using the combination of Cost-Sensitive XGBoost Classification and Pseudo-Labeling which is expected to solve the problems. For the 0th iteration, the dataset having a ratio of the amount of "right" labeled data with the amount of "right" labeled data is close to 1, in other words a balanced dataset or a ratio that is more than 1 produces a model with better performance. Thus, the selection of training data at an early stage must pay attention to this ratio. In addition, the use of the Hybrid Method on these datasets can save labeled data 56 times compared to the AdaBoost Method. Hybrid model is able to produce F1-Measure more than 95.6%, so it can be concluded that the Hybrid Method, which combines the XGBoost and Pseudo-Labeling Cost-Sensitive Classification with Self Training, is able to overcome the problem of unbalanced datasets and data limited label.
first_indexed	2024-03-14T00:09:56Z
format	Other
id	oai:generic.eprints.org:284295
institution	Universiti Gadjah Mada
language	English
last_indexed	2024-03-14T00:09:56Z
publishDate	2022
publisher	International Journal of Advanced Computer Science and Applications
record_format	dspace
spelling	oai:generic.eprints.org:2842952023-12-08T06:06:55Z https://repository.ugm.ac.id/284295/ Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling Pramularsih, Marvina Riasetiawan, Mardhani Distributed Computing There are two main problems on forming the Automatic Essay Scoring Model. They are the datasets having imbalanced amount of the right and wrong answers and the minimal use of labeled data in the model training. The model forming based on these problems is divided into three main points, namely word representation, Cost-Sensitive XGBoost Classification, and adding unlabeled data with the PseudoLabeling Technique. The essay answer data is converted into a vector using the trained word vector fastText. Furthermore, the classification of unlabeled data was carried out using the CostSensitive XGBoost Method. The data labeled by the classification model is added as training data for the new classification model form. The process is carried out iteratively. This research is about using the combination of Cost-Sensitive XGBoost Classification and Pseudo-Labeling which is expected to solve the problems. For the 0th iteration, the dataset having a ratio of the amount of "right" labeled data with the amount of "right" labeled data is close to 1, in other words a balanced dataset or a ratio that is more than 1 produces a model with better performance. Thus, the selection of training data at an early stage must pay attention to this ratio. In addition, the use of the Hybrid Method on these datasets can save labeled data 56 times compared to the AdaBoost Method. Hybrid model is able to produce F1-Measure more than 95.6%, so it can be concluded that the Hybrid Method, which combines the XGBoost and Pseudo-Labeling Cost-Sensitive Classification with Self Training, is able to overcome the problem of unbalanced datasets and data limited label. International Journal of Advanced Computer Science and Applications 2022 Other NonPeerReviewed application/pdf en https://repository.ugm.ac.id/284295/1/Solving-the-Imbalanced-and-Limited-Data-Labeled-for-Automated-Essay-Scoring-using-Cost-Sensitive-XGBoost-and-PseudoLabelingInternational-Journal-of-Advanced-Computer-Science-and-Applications.pdf Pramularsih, Marvina and Riasetiawan, Mardhani (2022) Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling. International Journal of Advanced Computer Science and Applications. https://thesai.org/Publications/ViewPaper?Volume=13&Issue=7&Code=IJACSA&SerialNo=10 10.14569/IJACSA.2022.0130710
spellingShingle	Distributed Computing Pramularsih, Marvina Riasetiawan, Mardhani Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title	Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title_full	Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title_fullStr	Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title_full_unstemmed	Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title_short	Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling
title_sort	solving the imbalanced and limited data labeled for automated essay scoring using cost sensitive xgboost and pseudo labeling
topic	Distributed Computing
url	https://repository.ugm.ac.id/284295/1/Solving-the-Imbalanced-and-Limited-Data-Labeled-for-Automated-Essay-Scoring-using-Cost-Sensitive-XGBoost-and-PseudoLabelingInternational-Journal-of-Advanced-Computer-Science-and-Applications.pdf
work_keys_str_mv	AT pramularsihmarvina solvingtheimbalancedandlimiteddatalabeledforautomatedessayscoringusingcostsensitivexgboostandpseudolabeling AT riasetiawanmardhani solvingtheimbalancedandlimiteddatalabeledforautomatedessayscoringusingcostsensitivexgboostandpseudolabeling

Solving the Imbalanced and Limited Data Labeled for Automated Essay Scoring using Cost Sensitive XGBoost and Pseudo-Labeling

Similar Items