Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive

The ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user beha...

Full description

Bibliographic Details
Main Authors:	Yijia Feng, Lei Wang
Format:	Article
Language:	English
Published:	MDPI AG 2023-08-01
Series:	Electronics
Subjects:	ItemCF algorithm speedup MapReduce framework hive data warehouse big data
Online Access:	https://www.mdpi.com/2079-9292/12/16/3398

_version_	1797584940634734592
author	Yijia Feng Lei Wang
author_facet	Yijia Feng Lei Wang
author_sort	Yijia Feng
collection	DOAJ
description	The ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user behavior data in the current big data era, has led to a bottleneck in improving the execution efficiency of recommender systems. With the vigorous development of distributed technology, distributed ItemCF algorithms have become a research hotspot. Hadoop is a very popular distributed system infrastructure. MapReduce, which provides massive data computing, and Hive, a data warehousing tool, are the two core components of Hadoop, each with its own advantages and applicable scenarios. Scholars have already utilized MapReduce and Hive for the parallelization of the ItemCF algorithm. However, these pieces of literature make use of either MapReduce or Hive alone without fully leveraging the strengths of both. As a result, it has been difficult for parallel ItemCF recommendation algorithms to feature both simple and efficient implementation and high running efficiency. To address this issue, we proposed a distributed ItemCF recommendation algorithm based on the combination of MapReduce and Hive and named it HiMRItemCF. This algorithm divided ItemCF into six steps: deduplication, obtaining the preference matrixes of all users, obtaining the co-occurrence matrixes of all items, multiplying the two matrices to generate a three-dimensional matrix, aggregating the data of the three-dimensional matrix to obtain the recommendation scores of all users for all items, and sorting the scores in descending order, with Hive being used to carry out steps 1 and 6, and MapReduce for the other four steps involving more complex calculations and operations. The Hive jobs and MapReduce jobs are linked through Hive’s external tables. After implementing the proposed algorithm using Java and running the program on three publicly available user shopping behavior datasets, we found that compared to algorithms that only use MapReduce jobs, the program implementing the proposed algorithm has fewer lines of source code, lower cyclomatic complexity and Halstead complexity, and can achieve a higher speedup ratio and parallel computing efficiency when processing all datasets. These experimental results indicate that the parallel and distributed ItemCF algorithm proposed in this paper, which combines MapReduce and Hive, has both the advantages of concise and easy-to-understand code as well as high time efficiency.
first_indexed	2024-03-10T23:59:47Z
format	Article
id	doaj.art-f7cc0a0ce2414d77843852cf2f4583f2
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T23:59:47Z
publishDate	2023-08-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-f7cc0a0ce2414d77843852cf2f4583f22023-11-19T00:53:01ZengMDPI AGElectronics2079-92922023-08-011216339810.3390/electronics12163398Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and HiveYijia Feng0Lei Wang1College of Mathematics and Computer Science, Yan’an University, Yan’an 716000, ChinaCollege of Mathematics and Computer Science, Yan’an University, Yan’an 716000, ChinaThe ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user behavior data in the current big data era, has led to a bottleneck in improving the execution efficiency of recommender systems. With the vigorous development of distributed technology, distributed ItemCF algorithms have become a research hotspot. Hadoop is a very popular distributed system infrastructure. MapReduce, which provides massive data computing, and Hive, a data warehousing tool, are the two core components of Hadoop, each with its own advantages and applicable scenarios. Scholars have already utilized MapReduce and Hive for the parallelization of the ItemCF algorithm. However, these pieces of literature make use of either MapReduce or Hive alone without fully leveraging the strengths of both. As a result, it has been difficult for parallel ItemCF recommendation algorithms to feature both simple and efficient implementation and high running efficiency. To address this issue, we proposed a distributed ItemCF recommendation algorithm based on the combination of MapReduce and Hive and named it HiMRItemCF. This algorithm divided ItemCF into six steps: deduplication, obtaining the preference matrixes of all users, obtaining the co-occurrence matrixes of all items, multiplying the two matrices to generate a three-dimensional matrix, aggregating the data of the three-dimensional matrix to obtain the recommendation scores of all users for all items, and sorting the scores in descending order, with Hive being used to carry out steps 1 and 6, and MapReduce for the other four steps involving more complex calculations and operations. The Hive jobs and MapReduce jobs are linked through Hive’s external tables. After implementing the proposed algorithm using Java and running the program on three publicly available user shopping behavior datasets, we found that compared to algorithms that only use MapReduce jobs, the program implementing the proposed algorithm has fewer lines of source code, lower cyclomatic complexity and Halstead complexity, and can achieve a higher speedup ratio and parallel computing efficiency when processing all datasets. These experimental results indicate that the parallel and distributed ItemCF algorithm proposed in this paper, which combines MapReduce and Hive, has both the advantages of concise and easy-to-understand code as well as high time efficiency.https://www.mdpi.com/2079-9292/12/16/3398ItemCF algorithmspeedupMapReduce frameworkhive data warehousebig data
spellingShingle	Yijia Feng Lei Wang Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive Electronics ItemCF algorithm speedup MapReduce framework hive data warehouse big data
title	Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_full	Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_fullStr	Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_full_unstemmed	Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_short	Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_sort	distributed itemcf recommendation algorithm based on the combination of mapreduce and hive
topic	ItemCF algorithm speedup MapReduce framework hive data warehouse big data
url	https://www.mdpi.com/2079-9292/12/16/3398
work_keys_str_mv	AT yijiafeng distributeditemcfrecommendationalgorithmbasedonthecombinationofmapreduceandhive AT leiwang distributeditemcfrecommendationalgorithmbasedonthecombinationofmapreduceandhive

Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive

Similar Items