Performance evaluation of distributed indexing using Solr and Terrier information retrievals

The continuous growing datasets and the emergence terabyte-scale data pose great challenges to Information Retrieval (IR) systems. Tremendously, a large amount of data from various aspects is collected every day making the amount of raw data extremely large. As a result, indexing a large volume of d...

Full description

Bibliographic Details
Main Authors: Aldailamy, Ali Y., Abdul Hamid, Nor Asila Wati, Al-Mekhlafi, Mohammed Abdulkarem
Format: Conference or Workshop Item
Language:English
Published: IEEE 2018
Online Access:http://psasir.upm.edu.my/id/eprint/69482/1/Performance%20evaluation%20of%20distributed%20indexing%20using%20Solr%20and%20Terrier%20information%20retrievals.pdf
_version_ 1796978944616955904
author Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Al-Mekhlafi, Mohammed Abdulkarem
author_facet Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Al-Mekhlafi, Mohammed Abdulkarem
author_sort Aldailamy, Ali Y.
collection UPM
description The continuous growing datasets and the emergence terabyte-scale data pose great challenges to Information Retrieval (IR) systems. Tremendously, a large amount of data from various aspects is collected every day making the amount of raw data extremely large. As a result, indexing a large volume of data is a time-consuming problem. Therefore, efficient indexing of large collections is getting more challenging. MapReduce is a programming model for the computing of large document collections by distributing data and processing tasks over multiple computing machines. In this study, Solr and Terrier distributed indexing will be evaluated as they are the most popular information retrieval frameworks among researchers and enterprises. To be more specific, this paper will compare and analyze the distributed indexing performance over MapReduce for the indexing strategies of Solr and Terrier using 1GB, 3GB, 6GB, and 9GB datasets. In the experiments, the indexing average time, speedup, and throughput are observed as the number of machines involved in the experiments increases for both indexing frameworks. The experimental results show that Terrier is more efficient with large datasets in the presence of processing resource scalability. On the other hand, Solr performed better with small datasets using limited computing resources.
first_indexed 2024-03-06T10:01:49Z
format Conference or Workshop Item
id upm.eprints-69482
institution Universiti Putra Malaysia
language English
last_indexed 2024-03-06T10:01:49Z
publishDate 2018
publisher IEEE
record_format dspace
spelling upm.eprints-694822020-05-25T01:46:20Z http://psasir.upm.edu.my/id/eprint/69482/ Performance evaluation of distributed indexing using Solr and Terrier information retrievals Aldailamy, Ali Y. Abdul Hamid, Nor Asila Wati Al-Mekhlafi, Mohammed Abdulkarem The continuous growing datasets and the emergence terabyte-scale data pose great challenges to Information Retrieval (IR) systems. Tremendously, a large amount of data from various aspects is collected every day making the amount of raw data extremely large. As a result, indexing a large volume of data is a time-consuming problem. Therefore, efficient indexing of large collections is getting more challenging. MapReduce is a programming model for the computing of large document collections by distributing data and processing tasks over multiple computing machines. In this study, Solr and Terrier distributed indexing will be evaluated as they are the most popular information retrieval frameworks among researchers and enterprises. To be more specific, this paper will compare and analyze the distributed indexing performance over MapReduce for the indexing strategies of Solr and Terrier using 1GB, 3GB, 6GB, and 9GB datasets. In the experiments, the indexing average time, speedup, and throughput are observed as the number of machines involved in the experiments increases for both indexing frameworks. The experimental results show that Terrier is more efficient with large datasets in the presence of processing resource scalability. On the other hand, Solr performed better with small datasets using limited computing resources. IEEE 2018 Conference or Workshop Item PeerReviewed text en http://psasir.upm.edu.my/id/eprint/69482/1/Performance%20evaluation%20of%20distributed%20indexing%20using%20Solr%20and%20Terrier%20information%20retrievals.pdf Aldailamy, Ali Y. and Abdul Hamid, Nor Asila Wati and Al-Mekhlafi, Mohammed Abdulkarem (2018) Performance evaluation of distributed indexing using Solr and Terrier information retrievals. In: 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP'18), 26-28 Mar. 2018, Le Méridien Kota Kinabalu, Sabah, Malaysia. (pp. 142-149). 10.1109/INFRKM.2018.8464814
spellingShingle Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Al-Mekhlafi, Mohammed Abdulkarem
Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title_full Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title_fullStr Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title_full_unstemmed Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title_short Performance evaluation of distributed indexing using Solr and Terrier information retrievals
title_sort performance evaluation of distributed indexing using solr and terrier information retrievals
url http://psasir.upm.edu.my/id/eprint/69482/1/Performance%20evaluation%20of%20distributed%20indexing%20using%20Solr%20and%20Terrier%20information%20retrievals.pdf
work_keys_str_mv AT aldailamyaliy performanceevaluationofdistributedindexingusingsolrandterrierinformationretrievals
AT abdulhamidnorasilawati performanceevaluationofdistributedindexingusingsolrandterrierinformationretrievals
AT almekhlafimohammedabdulkarem performanceevaluationofdistributedindexingusingsolrandterrierinformationretrievals