Reliable, efficient and light distance computation on high-dimensional vectors

The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately...

Full description

Bibliographic Details
Main Author:	Gao, Jianyang
Other Authors:	Long Cheng
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2025
Subjects:	Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization
Online Access:	https://hdl.handle.net/10356/182813

_version_	1826128914070634496
author	Gao, Jianyang
author2	Long Cheng
author_facet	Long Cheng Gao, Jianyang
author_sort	Gao, Jianyang
collection	NTU
description	The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off.
first_indexed	2025-03-09T14:57:23Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/182813
institution	Nanyang Technological University
language	English
last_indexed	2025-03-09T14:57:23Z
publishDate	2025
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1828132025-03-03T02:12:23Z Reliable, efficient and light distance computation on high-dimensional vectors Gao, Jianyang Long Cheng College of Computing and Data Science University of Michigan The Hong Kong University of Science and Technology c.long@ntu.edu.sg Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off. Doctor of Philosophy 2025-03-03T02:12:22Z 2025-03-03T02:12:22Z 2025 Thesis-Doctor of Philosophy Gao, J. (2025). Reliable, efficient and light distance computation on high-dimensional vectors. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182813 https://hdl.handle.net/10356/182813 en Tier 2 MOE-T2EP20220-0011 Tier 2 MOE-T2EP20221-0013 Tier 1 Award (RG77/21) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization Gao, Jianyang Reliable, efficient and light distance computation on high-dimensional vectors
title	Reliable, efficient and light distance computation on high-dimensional vectors
title_full	Reliable, efficient and light distance computation on high-dimensional vectors
title_fullStr	Reliable, efficient and light distance computation on high-dimensional vectors
title_full_unstemmed	Reliable, efficient and light distance computation on high-dimensional vectors
title_short	Reliable, efficient and light distance computation on high-dimensional vectors
title_sort	reliable efficient and light distance computation on high dimensional vectors
topic	Computer and Information Science Approximate nearest neighbor search Johnson-Lindenstrauss transformation Quantization
url	https://hdl.handle.net/10356/182813
work_keys_str_mv	AT gaojianyang reliableefficientandlightdistancecomputationonhighdimensionalvectors

Reliable, efficient and light distance computation on high-dimensional vectors

Similar Items