Timely dropout prediction with learner behavior data

This thesis addresses challenges associated with learner behavioral data. Specifically, this thesis addresses accuracy and timeliness for early prediction of dropouts. Challenges of analyzing learner behavioral data include (1) variations of learner behavior that may lead to poor prediction performa...

Full description

Bibliographic Details
Main Author:	Liu, Kai
Other Authors:	Andy Khong W H
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Electrical and electronic engineering::Computer hardware, software and systems Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Online Access:	https://hdl.handle.net/10356/155739

Description
Summary:	This thesis addresses challenges associated with learner behavioral data. Specifically, this thesis addresses accuracy and timeliness for early prediction of dropouts. Challenges of analyzing learner behavioral data include (1) variations of learner behavior that may lead to poor prediction performance and (2) complexity of data manipulations that impose latency for on-the-fly computation. For the prediction of dropouts via learner behavioral data, this thesis focuses on addressing problems associated with inter- and intra-learner variations of behaviors affecting the performance of the dropout prediction algorithms. To address this challenge, a feature generation approach that allows dropout prediction models to exploit current learner behavioral data is proposed. As will be highlighted in this thesis, this proposed algorithm analyzes learning behaviors across time for each learner and determines appropriate weightings for the behavioral features for each time slice based on both recency and correlation. This, in turn, allows existing machine learning models in the second stage to extract patterns from each learner for dropout prediction. Performance of the proposed feature generation approach was evaluated along with various machine learning algorithms that employ the proposed generated features. The experiment results demonstrated that the proposed feature generation approach outperformed the baseline approaches in terms of accuracy (F1 scores) and false-positive rates, especially in the early weeks. For example, in Week 1, the proposed approach achieved an average F1 score of 0.854, leading 2.2% to 4.4% across the baseline approaches while achieving a 14.2% lower false-positive rate. In addition, the area under the ROC curve (AUC) demonstrated that the proposed approach achieved an average of 4.8% to 15.6% improvements across different machine learning algorithms. In addition to the above, this thesis also presents another feature generation algorithm that tracks the feature representation of learning behaviors. This is achieved by incorporating an adaptive filter prior to machine learning models. As opposed to existing approaches, the adaptive filter in the proposed framework learns any variations of learning behaviors as a course progresses and allocates weights dynamically on past learner behaviors to generate an indicative feature representation. The experiment results showed that the proposed approach achieved the highest AUC of 0.8485, leading 2.9% to 7.3% compared to the baseline approaches. Performance of the proposed framework is validated on a corporate training dataset. From the perspective of timeliness of presenting learning analytic outcomes, this thesis focuses on the learning analytics cycle of learners, data, metrics, and intervention. Within a given cycle, due to the heterogeneity and multi-granularity of the data, complex on-the-fly computations are required when responding to queries of specific granularity interest. In addition, storage of the computed results, which are utilized for follow-up analysis, is not optimized in existing data models. To address such challenges, a context-based data model that standardizes multi-granular data for swift data retrieval is proposed. As will be shown in this thesis, the proposed context-based data model for learning analytics (cDMLA) defines an adaptive ontology by incorporating a multi-level hierarchical temporal model (HTM) for organizing data into a learning-activity tree. To further reduce the computational complexity for summative analysis, cDMLA incorporates a contextual-summary model (CSM) that updates any auxiliary information about learner(s) via a learning-context table as the learning journey unfolds. Advantages of the proposed model are demonstrated by analysis and visualization on existing XuetangX and ASSISTments datasets. At the ideal scenario, the proposed model achieves the performance improvement of 76.5% to 99.5% depending on the complexity of the tasks. While in the practical scenario, the performance improvement ranges from 76.6% to 98.8%. Lastly, while the cDMLA algorithm constructs HTM and CSM for analysis and visualization, it relies on a series of transformations and algorithms to compute attributes associated with each data point (a node on the HTM or a cell in the CSM). A data-driven dependency-based aggregation algorithm is, therefore, proposed to facilitate the construction of the learning behavior features from cDMLA. This is achieved by analyzing the dependency among the attributes and optimizing the sequence of the computation process. Advantages of the proposed algorithm are demonstrated via a dropout prediction system. To this end, the proposed cDMLA algorithm and the data-driven dependency-based aggregation algorithm achieve a closed-looped system for real-time feedbacks.

Timely dropout prediction with learner behavior data

Similar Items