Analysing Document Similarity Measures

Supervised by Professor Stephen Pulman. Obtained distinction on MSc. The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises...

詳細記述

書誌詳細
第一著者:	Grefenstette, E
フォーマット:	学位論文
出版事項:	2009

_version_	1826312177166843904
author	Grefenstette, E
author_facet	Grefenstette, E
author_sort	Grefenstette, E
collection	OXFORD
description	<p><em>Supervised by Professor Stephen Pulman. Obtained distinction on MSc.</em></p> <p>The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises a few questions about their nature. What is the common thread to document similarity measure design? Is it a software engineering problem, or are there general principles feeding their construction? Are metrics designed for one purpose suitable for another? How would we determine it if they were? How do they deal with different kinds of input (words, sentences, sets of paragraphs)? On what grounds can we compare metrics? How do we choose a 'better' metric relative to a task?</p> <p>This jumble of questions justifies further work, but leaves us with little clue how to begin. Attempts have been made in the computational linguistics literature to answer some of these questions with regard to small groups of metrics, particularly within the context of comparing two specific types of metrics, however we found no attempt at giving a general theory of metric design and analysis in the literature, and have resolves to approach the problem ourselves.</p> <p>The common theme to the above questions can thus be synthesised into the following three questions which will form the basis of our investigation. Firstly, how are common document similarity measures designed and implemented? We wish, in answering this question, to learn a bit more about the kinds of metrics commonly used in text processing, and the sort of difficulties arising when considering how to use them in practice. Secondly, how can we analyse document similarity? In answering this question, which will form the bulk of our project, we wish to discuss how metrics can be compared and ranked relative to different types of document similarity, thus giving us some insight into their performance for a variety of text processing tasks. Thirdly and finally, how can the results of such analysis be leveraged to improve metrics or produce better metrics?</p> <p>In this dissertation, we describe both an experiment and the construction of a extensible metric analysis framework in an attempt to answer the above key questions.</p>
first_indexed	2024-03-07T08:25:08Z
format	Thesis
id	oxford-uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a5
institution	University of Oxford
last_indexed	2024-03-07T08:25:08Z
publishDate	2009
record_format	dspace
spelling	oxford-uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a52024-02-12T11:31:06ZAnalysing Document Similarity MeasuresThesishttp://purl.org/coar/resource_type/c_db06uuid:b54ce6e8-3a1c-405a-aadd-6d15f6ac12a5Department of Computer Science2009Grefenstette, E<p><em>Supervised by Professor Stephen Pulman. Obtained distinction on MSc.</em></p> <p>The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises a few questions about their nature. What is the common thread to document similarity measure design? Is it a software engineering problem, or are there general principles feeding their construction? Are metrics designed for one purpose suitable for another? How would we determine it if they were? How do they deal with different kinds of input (words, sentences, sets of paragraphs)? On what grounds can we compare metrics? How do we choose a 'better' metric relative to a task?</p> <p>This jumble of questions justifies further work, but leaves us with little clue how to begin. Attempts have been made in the computational linguistics literature to answer some of these questions with regard to small groups of metrics, particularly within the context of comparing two specific types of metrics, however we found no attempt at giving a general theory of metric design and analysis in the literature, and have resolves to approach the problem ourselves.</p> <p>The common theme to the above questions can thus be synthesised into the following three questions which will form the basis of our investigation. Firstly, how are common document similarity measures designed and implemented? We wish, in answering this question, to learn a bit more about the kinds of metrics commonly used in text processing, and the sort of difficulties arising when considering how to use them in practice. Secondly, how can we analyse document similarity? In answering this question, which will form the bulk of our project, we wish to discuss how metrics can be compared and ranked relative to different types of document similarity, thus giving us some insight into their performance for a variety of text processing tasks. Thirdly and finally, how can the results of such analysis be leveraged to improve metrics or produce better metrics?</p> <p>In this dissertation, we describe both an experiment and the construction of a extensible metric analysis framework in an attempt to answer the above key questions.</p>
spellingShingle	Grefenstette, E Analysing Document Similarity Measures
title	Analysing Document Similarity Measures
title_full	Analysing Document Similarity Measures
title_fullStr	Analysing Document Similarity Measures
title_full_unstemmed	Analysing Document Similarity Measures
title_short	Analysing Document Similarity Measures
title_sort	analysing document similarity measures
work_keys_str_mv	AT grefenstettee analysingdocumentsimilaritymeasures

Analysing Document Similarity Measures

類似資料