Summary: | There are two main problems on forming the
Automatic Essay Scoring Model. They are the datasets having
imbalanced amount of the right and wrong answers and the
minimal use of labeled data in the model training. The model
forming based on these problems is divided into three main
points, namely word representation, Cost-Sensitive XGBoost
Classification, and adding unlabeled data with the PseudoLabeling Technique. The essay answer data is converted into a
vector using the trained word vector fastText. Furthermore, the
classification of unlabeled data was carried out using the CostSensitive XGBoost Method. The data labeled by the classification
model is added as training data for the new classification model
form. The process is carried out iteratively. This research is
about using the combination of Cost-Sensitive XGBoost
Classification and Pseudo-Labeling which is expected to solve the
problems. For the 0th iteration, the dataset having a ratio of the
amount of "right" labeled data with the amount of "right"
labeled data is close to 1, in other words a balanced dataset or a
ratio that is more than 1 produces a model with better
performance. Thus, the selection of training data at an early
stage must pay attention to this ratio. In addition, the use of the
Hybrid Method on these datasets can save labeled data 56 times
compared to the AdaBoost Method. Hybrid model is able to
produce F1-Measure more than 95.6%, so it can be concluded
that the Hybrid Method, which combines the XGBoost and
Pseudo-Labeling Cost-Sensitive Classification with Self Training,
is able to overcome the problem of unbalanced datasets and data
limited label.
|