Information Retrieval with Dense and Sparse Representations

Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in...

Full description

Bibliographic Details
Main Author: Chuang, Yung-Sung
Other Authors: Glass, James R.
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/153774
_version_ 1826204423716601856
author Chuang, Yung-Sung
author2 Glass, James R.
author_facet Glass, James R.
Chuang, Yung-Sung
author_sort Chuang, Yung-Sung
collection MIT
description Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in dense retrieval, or rely on exact lexical matching and thus overlook the broader contextual relevance when using sparse retrieval. This thesis delves into improving both dense and sparse retrieval systems with advanced language models and training strategies. We first introduce DiffCSE, a difference-based contrastive learning framework for unsupervised sentence embedding and dense retrieval that can effectively capture minor differences in sentences, showcasing improved performance in semantic tasks and retrieval for open-domain question answering. We then address sparse retrieval's limitations by developing a query expansion and reranking procedure. Using pre-trained language models, we propose an expansion and reranking pipeline for better query expansion, achieving superior retrieval results both in-domain and out-of-domain, yet retaining sparse retrieval's computational efficiency. In summary, this thesis provides a comprehensive exploration of advancing information retrieval in the generation of large language models.
first_indexed 2024-09-23T12:54:50Z
format Thesis
id mit-1721.1/153774
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:54:50Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1537742024-03-16T03:33:33Z Information Retrieval with Dense and Sparse Representations Chuang, Yung-Sung Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in dense retrieval, or rely on exact lexical matching and thus overlook the broader contextual relevance when using sparse retrieval. This thesis delves into improving both dense and sparse retrieval systems with advanced language models and training strategies. We first introduce DiffCSE, a difference-based contrastive learning framework for unsupervised sentence embedding and dense retrieval that can effectively capture minor differences in sentences, showcasing improved performance in semantic tasks and retrieval for open-domain question answering. We then address sparse retrieval's limitations by developing a query expansion and reranking procedure. Using pre-trained language models, we propose an expansion and reranking pipeline for better query expansion, achieving superior retrieval results both in-domain and out-of-domain, yet retaining sparse retrieval's computational efficiency. In summary, this thesis provides a comprehensive exploration of advancing information retrieval in the generation of large language models. S.M. 2024-03-15T19:23:06Z 2024-03-15T19:23:06Z 2024-02 2024-02-21T17:10:06.811Z Thesis https://hdl.handle.net/1721.1/153774 0000-0002-1723-5063 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Chuang, Yung-Sung
Information Retrieval with Dense and Sparse Representations
title Information Retrieval with Dense and Sparse Representations
title_full Information Retrieval with Dense and Sparse Representations
title_fullStr Information Retrieval with Dense and Sparse Representations
title_full_unstemmed Information Retrieval with Dense and Sparse Representations
title_short Information Retrieval with Dense and Sparse Representations
title_sort information retrieval with dense and sparse representations
url https://hdl.handle.net/1721.1/153774
work_keys_str_mv AT chuangyungsung informationretrievalwithdenseandsparserepresentations