An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles

The comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many...

Full description

Bibliographic Details
Main Authors: Joaquin Gómez, Pere-Pau Vázquez
Format: Article
Language:English
Published: MDPI AG 2022-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/11/5664
_version_ 1797494041871384576
author Joaquin Gómez
Pere-Pau Vázquez
author_facet Joaquin Gómez
Pere-Pau Vázquez
author_sort Joaquin Gómez
collection DOAJ
description The comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many such metrics have been proposed in the literature. Lately, deep learning techniques have gained a lot of popularity. However, it is difficult to analyze how those metrics perform against each other. In this paper, we present a systematic empirical evaluation of several of the most popular similarity metrics when applied to research articles. We analyze the results of those metrics in two ways, with a synthetic test that uses scientific papers and Ph.D. theses, and in a real-world scenario where we evaluate their ability to cluster papers from different areas of research.
first_indexed 2024-03-10T01:28:37Z
format Article
id doaj.art-4de93f962d5547fca9a05df362d4c7a3
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T01:28:37Z
publishDate 2022-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-4de93f962d5547fca9a05df362d4c7a32023-11-23T13:45:44ZengMDPI AGApplied Sciences2076-34172022-06-011211566410.3390/app12115664An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific ArticlesJoaquin Gómez0Pere-Pau Vázquez1Department of Computer Science, Universitat Politècnica de Catalunya, 08034 Barcelona, SpainViRVIG Group, Department of Computer Science, Universitat Politècnica de Catalunya, 08034 Barcelona, SpainThe comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many such metrics have been proposed in the literature. Lately, deep learning techniques have gained a lot of popularity. However, it is difficult to analyze how those metrics perform against each other. In this paper, we present a systematic empirical evaluation of several of the most popular similarity metrics when applied to research articles. We analyze the results of those metrics in two ways, with a synthetic test that uses scientific papers and Ph.D. theses, and in a real-world scenario where we evaluate their ability to cluster papers from different areas of research.https://www.mdpi.com/2076-3417/12/11/5664document similaritysimilarity measuresword embeddingsnatural language processing
spellingShingle Joaquin Gómez
Pere-Pau Vázquez
An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
Applied Sciences
document similarity
similarity measures
word embeddings
natural language processing
title An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
title_full An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
title_fullStr An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
title_full_unstemmed An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
title_short An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
title_sort empirical evaluation of document embeddings and similarity metrics for scientific articles
topic document similarity
similarity measures
word embeddings
natural language processing
url https://www.mdpi.com/2076-3417/12/11/5664
work_keys_str_mv AT joaquingomez anempiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles
AT perepauvazquez anempiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles
AT joaquingomez empiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles
AT perepauvazquez empiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles