Reliable, efficient and light distance computation on high-dimensional vectors

The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately...

Full description

Bibliographic Details
Main Author: Gao, Jianyang
Other Authors: Long Cheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182813
_version_ 1826128914070634496
author Gao, Jianyang
author2 Long Cheng
author_facet Long Cheng
Gao, Jianyang
author_sort Gao, Jianyang
collection NTU
description The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off.
first_indexed 2025-03-09T14:57:23Z
format Thesis-Doctor of Philosophy
id ntu-10356/182813
institution Nanyang Technological University
language English
last_indexed 2025-03-09T14:57:23Z
publishDate 2025
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1828132025-03-03T02:12:23Z Reliable, efficient and light distance computation on high-dimensional vectors Gao, Jianyang Long Cheng College of Computing and Data Science University of Michigan The Hong Kong University of Science and Technology c.long@ntu.edu.sg Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off. Doctor of Philosophy 2025-03-03T02:12:22Z 2025-03-03T02:12:22Z 2025 Thesis-Doctor of Philosophy Gao, J. (2025). Reliable, efficient and light distance computation on high-dimensional vectors. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182813 https://hdl.handle.net/10356/182813 en Tier 2 MOE-T2EP20220-0011 Tier 2 MOE-T2EP20221-0013 Tier 1 Award (RG77/21) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
Approximate nearest neighbor search
Johnson-Lindenstrauss transformation
Quantization
Gao, Jianyang
Reliable, efficient and light distance computation on high-dimensional vectors
title Reliable, efficient and light distance computation on high-dimensional vectors
title_full Reliable, efficient and light distance computation on high-dimensional vectors
title_fullStr Reliable, efficient and light distance computation on high-dimensional vectors
title_full_unstemmed Reliable, efficient and light distance computation on high-dimensional vectors
title_short Reliable, efficient and light distance computation on high-dimensional vectors
title_sort reliable efficient and light distance computation on high dimensional vectors
topic Computer and Information Science
Approximate nearest neighbor search
Johnson-Lindenstrauss transformation
Quantization
url https://hdl.handle.net/10356/182813
work_keys_str_mv AT gaojianyang reliableefficientandlightdistancecomputationonhighdimensionalvectors