Analysing Document Similarity Measures

<p><em>Supervised by Professor Stephen Pulman. Obtained distinction on MSc.</em></p> <p>The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises...

詳細記述

書誌詳細
第一著者: Grefenstette, E
フォーマット: 学位論文
出版事項: 2009
_version_ 1826312177166843904
author Grefenstette, E
author_facet Grefenstette, E
author_sort Grefenstette, E
collection OXFORD
description <p><em>Supervised by Professor Stephen Pulman. Obtained distinction on MSc.</em></p> <p>The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises a few questions about their nature. What is the common thread to document similarity measure design? Is it a software engineering problem, or are there general principles feeding their construction? Are metrics designed for one purpose suitable for another? How would we determine it if they were? How do they deal with different kinds of input (words, sentences, sets of paragraphs)? On what grounds can we compare metrics? How do we choose a 'better' metric relative to a task?</p> <p>This jumble of questions justifies further work, but leaves us with little clue how to begin. Attempts have been made in the computational linguistics literature to answer some of these questions with regard to small groups of metrics, particularly within the context of comparing two specific types of metrics, however we found no attempt at giving a general theory of metric design and analysis in the literature, and have resolves to approach the problem ourselves.</p> <p>The common theme to the above questions can thus be synthesised into the following three questions which will form the basis of our investigation. Firstly, how are common document similarity measures designed and implemented? We wish, in answering this question, to learn a bit more about the kinds of metrics commonly used in text processing, and the sort of difficulties arising when considering how to use them in practice. Secondly, how can we analyse document similarity? In answering this question, which will form the bulk of our project, we wish to discuss how metrics can be compared and ranked relative to different types of document similarity, thus giving us some insight into their performance for a variety of text processing tasks. Thirdly and finally, how can the results of such analysis be leveraged to improve metrics or produce better metrics?</p> <p>In this dissertation, we describe both an experiment and the construction of a extensible metric analysis framework in an attempt to answer the above key questions.</p>
first_indexed 2024-03-07T08:25:08Z
format Thesis
id oxford-uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a5
institution University of Oxford
last_indexed 2024-03-07T08:25:08Z
publishDate 2009
record_format dspace
spelling oxford-uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a52024-02-12T11:31:06ZAnalysing Document Similarity MeasuresThesishttp://purl.org/coar/resource_type/c_db06uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a5Department of Computer Science2009Grefenstette, E<p><em>Supervised by Professor Stephen Pulman. Obtained distinction on MSc.</em></p> <p>The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises a few questions about their nature. What is the common thread to document similarity measure design? Is it a software engineering problem, or are there general principles feeding their construction? Are metrics designed for one purpose suitable for another? How would we determine it if they were? How do they deal with different kinds of input (words, sentences, sets of paragraphs)? On what grounds can we compare metrics? How do we choose a 'better' metric relative to a task?</p> <p>This jumble of questions justifies further work, but leaves us with little clue how to begin. Attempts have been made in the computational linguistics literature to answer some of these questions with regard to small groups of metrics, particularly within the context of comparing two specific types of metrics, however we found no attempt at giving a general theory of metric design and analysis in the literature, and have resolves to approach the problem ourselves.</p> <p>The common theme to the above questions can thus be synthesised into the following three questions which will form the basis of our investigation. Firstly, how are common document similarity measures designed and implemented? We wish, in answering this question, to learn a bit more about the kinds of metrics commonly used in text processing, and the sort of difficulties arising when considering how to use them in practice. Secondly, how can we analyse document similarity? In answering this question, which will form the bulk of our project, we wish to discuss how metrics can be compared and ranked relative to different types of document similarity, thus giving us some insight into their performance for a variety of text processing tasks. Thirdly and finally, how can the results of such analysis be leveraged to improve metrics or produce better metrics?</p> <p>In this dissertation, we describe both an experiment and the construction of a extensible metric analysis framework in an attempt to answer the above key questions.</p>
spellingShingle Grefenstette, E
Analysing Document Similarity Measures
title Analysing Document Similarity Measures
title_full Analysing Document Similarity Measures
title_fullStr Analysing Document Similarity Measures
title_full_unstemmed Analysing Document Similarity Measures
title_short Analysing Document Similarity Measures
title_sort analysing document similarity measures
work_keys_str_mv AT grefenstettee analysingdocumentsimilaritymeasures