Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive

The ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user beha...

Full description

Bibliographic Details
Main Authors: Yijia Feng, Lei Wang
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/16/3398
_version_ 1797584940634734592
author Yijia Feng
Lei Wang
author_facet Yijia Feng
Lei Wang
author_sort Yijia Feng
collection DOAJ
description The ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user behavior data in the current big data era, has led to a bottleneck in improving the execution efficiency of recommender systems. With the vigorous development of distributed technology, distributed ItemCF algorithms have become a research hotspot. Hadoop is a very popular distributed system infrastructure. MapReduce, which provides massive data computing, and Hive, a data warehousing tool, are the two core components of Hadoop, each with its own advantages and applicable scenarios. Scholars have already utilized MapReduce and Hive for the parallelization of the ItemCF algorithm. However, these pieces of literature make use of either MapReduce or Hive alone without fully leveraging the strengths of both. As a result, it has been difficult for parallel ItemCF recommendation algorithms to feature both simple and efficient implementation and high running efficiency. To address this issue, we proposed a distributed ItemCF recommendation algorithm based on the combination of MapReduce and Hive and named it HiMRItemCF. This algorithm divided ItemCF into six steps: deduplication, obtaining the preference matrixes of all users, obtaining the co-occurrence matrixes of all items, multiplying the two matrices to generate a three-dimensional matrix, aggregating the data of the three-dimensional matrix to obtain the recommendation scores of all users for all items, and sorting the scores in descending order, with Hive being used to carry out steps 1 and 6, and MapReduce for the other four steps involving more complex calculations and operations. The Hive jobs and MapReduce jobs are linked through Hive’s external tables. After implementing the proposed algorithm using Java and running the program on three publicly available user shopping behavior datasets, we found that compared to algorithms that only use MapReduce jobs, the program implementing the proposed algorithm has fewer lines of source code, lower cyclomatic complexity and Halstead complexity, and can achieve a higher speedup ratio and parallel computing efficiency when processing all datasets. These experimental results indicate that the parallel and distributed ItemCF algorithm proposed in this paper, which combines MapReduce and Hive, has both the advantages of concise and easy-to-understand code as well as high time efficiency.
first_indexed 2024-03-10T23:59:47Z
format Article
id doaj.art-f7cc0a0ce2414d77843852cf2f4583f2
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T23:59:47Z
publishDate 2023-08-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-f7cc0a0ce2414d77843852cf2f4583f22023-11-19T00:53:01ZengMDPI AGElectronics2079-92922023-08-011216339810.3390/electronics12163398Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and HiveYijia Feng0Lei Wang1College of Mathematics and Computer Science, Yan’an University, Yan’an 716000, ChinaCollege of Mathematics and Computer Science, Yan’an University, Yan’an 716000, ChinaThe ItemCF algorithm is currently the most widely used recommendation algorithm in commercial applications. In the early days of recommender systems, most recommendation algorithms were run on a single machine rather than in parallel. This approach, coupled with the rapid growth of massive user behavior data in the current big data era, has led to a bottleneck in improving the execution efficiency of recommender systems. With the vigorous development of distributed technology, distributed ItemCF algorithms have become a research hotspot. Hadoop is a very popular distributed system infrastructure. MapReduce, which provides massive data computing, and Hive, a data warehousing tool, are the two core components of Hadoop, each with its own advantages and applicable scenarios. Scholars have already utilized MapReduce and Hive for the parallelization of the ItemCF algorithm. However, these pieces of literature make use of either MapReduce or Hive alone without fully leveraging the strengths of both. As a result, it has been difficult for parallel ItemCF recommendation algorithms to feature both simple and efficient implementation and high running efficiency. To address this issue, we proposed a distributed ItemCF recommendation algorithm based on the combination of MapReduce and Hive and named it HiMRItemCF. This algorithm divided ItemCF into six steps: deduplication, obtaining the preference matrixes of all users, obtaining the co-occurrence matrixes of all items, multiplying the two matrices to generate a three-dimensional matrix, aggregating the data of the three-dimensional matrix to obtain the recommendation scores of all users for all items, and sorting the scores in descending order, with Hive being used to carry out steps 1 and 6, and MapReduce for the other four steps involving more complex calculations and operations. The Hive jobs and MapReduce jobs are linked through Hive’s external tables. After implementing the proposed algorithm using Java and running the program on three publicly available user shopping behavior datasets, we found that compared to algorithms that only use MapReduce jobs, the program implementing the proposed algorithm has fewer lines of source code, lower cyclomatic complexity and Halstead complexity, and can achieve a higher speedup ratio and parallel computing efficiency when processing all datasets. These experimental results indicate that the parallel and distributed ItemCF algorithm proposed in this paper, which combines MapReduce and Hive, has both the advantages of concise and easy-to-understand code as well as high time efficiency.https://www.mdpi.com/2079-9292/12/16/3398ItemCF algorithmspeedupMapReduce frameworkhive data warehousebig data
spellingShingle Yijia Feng
Lei Wang
Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
Electronics
ItemCF algorithm
speedup
MapReduce framework
hive data warehouse
big data
title Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_full Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_fullStr Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_full_unstemmed Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_short Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
title_sort distributed itemcf recommendation algorithm based on the combination of mapreduce and hive
topic ItemCF algorithm
speedup
MapReduce framework
hive data warehouse
big data
url https://www.mdpi.com/2079-9292/12/16/3398
work_keys_str_mv AT yijiafeng distributeditemcfrecommendationalgorithmbasedonthecombinationofmapreduceandhive
AT leiwang distributeditemcfrecommendationalgorithmbasedonthecombinationofmapreduceandhive