An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles
The comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/11/5664 |
_version_ | 1797494041871384576 |
---|---|
author | Joaquin Gómez Pere-Pau Vázquez |
author_facet | Joaquin Gómez Pere-Pau Vázquez |
author_sort | Joaquin Gómez |
collection | DOAJ |
description | The comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many such metrics have been proposed in the literature. Lately, deep learning techniques have gained a lot of popularity. However, it is difficult to analyze how those metrics perform against each other. In this paper, we present a systematic empirical evaluation of several of the most popular similarity metrics when applied to research articles. We analyze the results of those metrics in two ways, with a synthetic test that uses scientific papers and Ph.D. theses, and in a real-world scenario where we evaluate their ability to cluster papers from different areas of research. |
first_indexed | 2024-03-10T01:28:37Z |
format | Article |
id | doaj.art-4de93f962d5547fca9a05df362d4c7a3 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T01:28:37Z |
publishDate | 2022-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-4de93f962d5547fca9a05df362d4c7a32023-11-23T13:45:44ZengMDPI AGApplied Sciences2076-34172022-06-011211566410.3390/app12115664An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific ArticlesJoaquin Gómez0Pere-Pau Vázquez1Department of Computer Science, Universitat Politècnica de Catalunya, 08034 Barcelona, SpainViRVIG Group, Department of Computer Science, Universitat Politècnica de Catalunya, 08034 Barcelona, SpainThe comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many such metrics have been proposed in the literature. Lately, deep learning techniques have gained a lot of popularity. However, it is difficult to analyze how those metrics perform against each other. In this paper, we present a systematic empirical evaluation of several of the most popular similarity metrics when applied to research articles. We analyze the results of those metrics in two ways, with a synthetic test that uses scientific papers and Ph.D. theses, and in a real-world scenario where we evaluate their ability to cluster papers from different areas of research.https://www.mdpi.com/2076-3417/12/11/5664document similaritysimilarity measuresword embeddingsnatural language processing |
spellingShingle | Joaquin Gómez Pere-Pau Vázquez An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles Applied Sciences document similarity similarity measures word embeddings natural language processing |
title | An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles |
title_full | An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles |
title_fullStr | An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles |
title_full_unstemmed | An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles |
title_short | An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles |
title_sort | empirical evaluation of document embeddings and similarity metrics for scientific articles |
topic | document similarity similarity measures word embeddings natural language processing |
url | https://www.mdpi.com/2076-3417/12/11/5664 |
work_keys_str_mv | AT joaquingomez anempiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles AT perepauvazquez anempiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles AT joaquingomez empiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles AT perepauvazquez empiricalevaluationofdocumentembeddingsandsimilaritymetricsforscientificarticles |