Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling l...

Full description

Bibliographic Details
Main Authors:	Samin Poudel, Marwan Bikdash
Format:	Article
Language:	English
Published:	Tsinghua University Press 2022-09-01
Series:	Big Data Mining and Analytics
Subjects:	collaborative filtering (cf) subsampling training time improvement (tti) performance loss recommendation system (rs) collaborative filtering optimal solutions rating matrix
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2021.9020032

_version_	1798029385011298304
author	Samin Poudel Marwan Bikdash
author_facet	Samin Poudel Marwan Bikdash
author_sort	Samin Poudel
collection	DOAJ
description	Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.
first_indexed	2024-04-11T19:23:31Z
format	Article
id	doaj.art-bdb6b5fe54ab481eabd106dec2fc3105
institution	Directory Open Access Journal
issn	2096-0654
language	English
last_indexed	2024-04-11T19:23:31Z
publishDate	2022-09-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj.art-bdb6b5fe54ab481eabd106dec2fc31052022-12-22T04:07:14ZengTsinghua University PressBig Data Mining and Analytics2096-06542022-09-015319220510.26599/BDMA.2021.9020032Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified SubsamplingSamin Poudel0Marwan Bikdash1Department of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27401, USADepartment of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27401, USADropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.https://www.sciopen.com/article/10.26599/BDMA.2021.9020032collaborative filtering (cf)subsamplingtraining time improvement (tti)performance lossrecommendation system (rs)collaborative filtering optimal solutionsrating matrix
spellingShingle	Samin Poudel Marwan Bikdash Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling Big Data Mining and Analytics collaborative filtering (cf) subsampling training time improvement (tti) performance loss recommendation system (rs) collaborative filtering optimal solutions rating matrix
title	Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
title_full	Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
title_fullStr	Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
title_full_unstemmed	Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
title_short	Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
title_sort	optimal dependence of performance and efficiency of collaborative filtering on random stratified subsampling
topic	collaborative filtering (cf) subsampling training time improvement (tti) performance loss recommendation system (rs) collaborative filtering optimal solutions rating matrix
url	https://www.sciopen.com/article/10.26599/BDMA.2021.9020032
work_keys_str_mv	AT saminpoudel optimaldependenceofperformanceandefficiencyofcollaborativefilteringonrandomstratifiedsubsampling AT marwanbikdash optimaldependenceofperformanceandefficiencyofcollaborativefilteringonrandomstratifiedsubsampling

Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Similar Items