Multi-level Persian Dataset for Information Retrieval

An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a largeInformation retrieval systems are an essential part of many smart systems. The applications of this research field include search engines such as Google and Bing, question-answe...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριοι συγγραφείς: Ali Abedzadeh, Reza Ramezani, Afsaneh Fatemi Khorasgani
Μορφή: Άρθρο
Γλώσσα:fas
Έκδοση: Iranian Research Institute for Information and Technology 2024-03-01
Σειρά:Iranian Journal of Information Processing & Management
Θέματα:
Διαθέσιμο Online:https://jipm.irandoc.ac.ir/article_710246_5b937c81f2c10ac4508ecee230e3beae.pdf
_version_ 1826533289406496768
author Ali Abedzadeh
Reza Ramezani
Afsaneh Fatemi Khorasgani
author_facet Ali Abedzadeh
Reza Ramezani
Afsaneh Fatemi Khorasgani
author_sort Ali Abedzadeh
collection DOAJ
description An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a largeInformation retrieval systems are an essential part of many smart systems. The applications of this research field include search engines such as Google and Bing, question-answering systems, modern databases, etc. An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a large collection of documents, and the size of this collection can be from a few thousand documents to millions of documents. In recent years, a lot of research has been done to develop information retrieval systems using language models. However, in this research field, no research has been done for the Persian language. One of its main reasons is the lack of a suitable Persian dataset for training language models. In this research, first, a Persian dataset for information retrieval is presented. After that, methods for enriching this data set are investigated. This enrichment is done by defining multi-level relationships between a document and a question. In this regard, the new dataset can show the relationship between question and document in four levels (unrelated, related, highly related, completely related) instead of two levels (completely unrelated, completely related). The name of the generated dataset is PersianMLIR. Experiments show that by using multi-level relationships, the performance of the system improves for both Persian and English languages, where the improvement is 1.87% for the Persian language. The results conclude that enriching information retrieval datasets by increasing the number of relations between query and document lead to improving the performance of information retrieval systems.
first_indexed 2025-03-14T02:04:46Z
format Article
id doaj.art-181ef441a3c140388f885bbd90d5c32e
institution Directory Open Access Journal
issn 2251-8223
2251-8231
language fas
last_indexed 2025-03-14T02:04:46Z
publishDate 2024-03-01
publisher Iranian Research Institute for Information and Technology
record_format Article
series Iranian Journal of Information Processing & Management
spelling doaj.art-181ef441a3c140388f885bbd90d5c32e2025-03-12T06:08:37ZfasIranian Research Institute for Information and TechnologyIranian Journal of Information Processing & Management2251-82232251-82312024-03-013931109113710.22034/jipm.2024.710246710246Multi-level Persian Dataset for Information RetrievalAli Abedzadeh0Reza Ramezani1Afsaneh Fatemi Khorasgani2Master of Software Engineering; Faculty of Computer Engineering; University of IsfahanPh.D. in Computer Engineering; Associate Professor; Faculty of Computer Engineering; University of Isfahan.Ph.D. in Computer Engineering ; Associate Professor; Faculty of Computer Engineering; University of Isfahan.An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a largeInformation retrieval systems are an essential part of many smart systems. The applications of this research field include search engines such as Google and Bing, question-answering systems, modern databases, etc. An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a large collection of documents, and the size of this collection can be from a few thousand documents to millions of documents. In recent years, a lot of research has been done to develop information retrieval systems using language models. However, in this research field, no research has been done for the Persian language. One of its main reasons is the lack of a suitable Persian dataset for training language models. In this research, first, a Persian dataset for information retrieval is presented. After that, methods for enriching this data set are investigated. This enrichment is done by defining multi-level relationships between a document and a question. In this regard, the new dataset can show the relationship between question and document in four levels (unrelated, related, highly related, completely related) instead of two levels (completely unrelated, completely related). The name of the generated dataset is PersianMLIR. Experiments show that by using multi-level relationships, the performance of the system improves for both Persian and English languages, where the improvement is 1.87% for the Persian language. The results conclude that enriching information retrieval datasets by increasing the number of relations between query and document lead to improving the performance of information retrieval systems.https://jipm.irandoc.ac.ir/article_710246_5b937c81f2c10ac4508ecee230e3beae.pdfinformation retrievallanguage modelsinformation retrieval datasetpersian dataset
spellingShingle Ali Abedzadeh
Reza Ramezani
Afsaneh Fatemi Khorasgani
Multi-level Persian Dataset for Information Retrieval
Iranian Journal of Information Processing & Management
information retrieval
language models
information retrieval dataset
persian dataset
title Multi-level Persian Dataset for Information Retrieval
title_full Multi-level Persian Dataset for Information Retrieval
title_fullStr Multi-level Persian Dataset for Information Retrieval
title_full_unstemmed Multi-level Persian Dataset for Information Retrieval
title_short Multi-level Persian Dataset for Information Retrieval
title_sort multi level persian dataset for information retrieval
topic information retrieval
language models
information retrieval dataset
persian dataset
url https://jipm.irandoc.ac.ir/article_710246_5b937c81f2c10ac4508ecee230e3beae.pdf
work_keys_str_mv AT aliabedzadeh multilevelpersiandatasetforinformationretrieval
AT rezaramezani multilevelpersiandatasetforinformationretrieval
AT afsanehfatemikhorasgani multilevelpersiandatasetforinformationretrieval