Reliable, efficient and light distance computation on high-dimensional vectors
The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182813 |
_version_ | 1826128914070634496 |
---|---|
author | Gao, Jianyang |
author2 | Long Cheng |
author_facet | Long Cheng Gao, Jianyang |
author_sort | Gao, Jianyang |
collection | NTU |
description | The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy.
Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling.
Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations.
Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off. |
first_indexed | 2025-03-09T14:57:23Z |
format | Thesis-Doctor of Philosophy |
id | ntu-10356/182813 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2025-03-09T14:57:23Z |
publishDate | 2025 |
publisher | Nanyang Technological University |
record_format | dspace |
spelling | ntu-10356/1828132025-03-03T02:12:23Z Reliable, efficient and light distance computation on high-dimensional vectors Gao, Jianyang Long Cheng College of Computing and Data Science University of Michigan The Hong Kong University of Science and Technology c.long@ntu.edu.sg Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off. Doctor of Philosophy 2025-03-03T02:12:22Z 2025-03-03T02:12:22Z 2025 Thesis-Doctor of Philosophy Gao, J. (2025). Reliable, efficient and light distance computation on high-dimensional vectors. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182813 https://hdl.handle.net/10356/182813 en Tier 2 MOE-T2EP20220-0011 Tier 2 MOE-T2EP20221-0013 Tier 1 Award (RG77/21) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
spellingShingle | Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization Gao, Jianyang Reliable, efficient and light distance computation on high-dimensional vectors |
title | Reliable, efficient and light distance computation on high-dimensional vectors |
title_full | Reliable, efficient and light distance computation on high-dimensional vectors |
title_fullStr | Reliable, efficient and light distance computation on high-dimensional vectors |
title_full_unstemmed | Reliable, efficient and light distance computation on high-dimensional vectors |
title_short | Reliable, efficient and light distance computation on high-dimensional vectors |
title_sort | reliable efficient and light distance computation on high dimensional vectors |
topic | Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization |
url | https://hdl.handle.net/10356/182813 |
work_keys_str_mv | AT gaojianyang reliableefficientandlightdistancecomputationonhighdimensionalvectors |