Information Retrieval with Dense and Sparse Representations
Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/153774 |
_version_ | 1826204423716601856 |
---|---|
author | Chuang, Yung-Sung |
author2 | Glass, James R. |
author_facet | Glass, James R. Chuang, Yung-Sung |
author_sort | Chuang, Yung-Sung |
collection | MIT |
description | Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in dense retrieval, or rely on exact lexical matching and thus overlook the broader contextual relevance when using sparse retrieval. This thesis delves into improving both dense and sparse retrieval systems with advanced language models and training strategies. We first introduce DiffCSE, a difference-based contrastive learning framework for unsupervised sentence embedding and dense retrieval that can effectively capture minor differences in sentences, showcasing improved performance in semantic tasks and retrieval for open-domain question answering. We then address sparse retrieval's limitations by developing a query expansion and reranking procedure. Using pre-trained language models, we propose an expansion and reranking pipeline for better query expansion, achieving superior retrieval results both in-domain and out-of-domain, yet retaining sparse retrieval's computational efficiency. In summary, this thesis provides a comprehensive exploration of advancing information retrieval in the generation of large language models. |
first_indexed | 2024-09-23T12:54:50Z |
format | Thesis |
id | mit-1721.1/153774 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T12:54:50Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1537742024-03-16T03:33:33Z Information Retrieval with Dense and Sparse Representations Chuang, Yung-Sung Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Information retrieval, at the core of numerous applications such as search engines and open-domain question-answering systems, relies on effective textual representation and semantic matching. However, current approaches can lose nuanced lexical detail information due to an information bottleneck in dense retrieval, or rely on exact lexical matching and thus overlook the broader contextual relevance when using sparse retrieval. This thesis delves into improving both dense and sparse retrieval systems with advanced language models and training strategies. We first introduce DiffCSE, a difference-based contrastive learning framework for unsupervised sentence embedding and dense retrieval that can effectively capture minor differences in sentences, showcasing improved performance in semantic tasks and retrieval for open-domain question answering. We then address sparse retrieval's limitations by developing a query expansion and reranking procedure. Using pre-trained language models, we propose an expansion and reranking pipeline for better query expansion, achieving superior retrieval results both in-domain and out-of-domain, yet retaining sparse retrieval's computational efficiency. In summary, this thesis provides a comprehensive exploration of advancing information retrieval in the generation of large language models. S.M. 2024-03-15T19:23:06Z 2024-03-15T19:23:06Z 2024-02 2024-02-21T17:10:06.811Z Thesis https://hdl.handle.net/1721.1/153774 0000-0002-1723-5063 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Chuang, Yung-Sung Information Retrieval with Dense and Sparse Representations |
title | Information Retrieval with Dense and Sparse Representations |
title_full | Information Retrieval with Dense and Sparse Representations |
title_fullStr | Information Retrieval with Dense and Sparse Representations |
title_full_unstemmed | Information Retrieval with Dense and Sparse Representations |
title_short | Information Retrieval with Dense and Sparse Representations |
title_sort | information retrieval with dense and sparse representations |
url | https://hdl.handle.net/1721.1/153774 |
work_keys_str_mv | AT chuangyungsung informationretrievalwithdenseandsparserepresentations |