Estimating Frequency Distributions in Data Streams
Streaming algorithms allow for space-efficient processing of massive datasets. The distribution of the frequencies of items in a large dataset is often used to characterize that data: e.g., the data is heavy-tailed, the data follows a power law, or there are many elements that only appear only once...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/150228 |
_version_ | 1826195986802802688 |
---|---|
author | Chen, Justin Y. |
author2 | Indyk, Piotr |
author_facet | Indyk, Piotr Chen, Justin Y. |
author_sort | Chen, Justin Y. |
collection | MIT |
description | Streaming algorithms allow for space-efficient processing of massive datasets. The distribution of the frequencies of items in a large dataset is often used to characterize that data: e.g., the data is heavy-tailed, the data follows a power law, or there are many elements that only appear only once or twice. In this thesis, we focus on the problem of estimating the profile (a vector representation of the frequency distribution). Given a sequence of m elements from a universe of size n, its profile is a vector φ whose i-th entry φᵢ represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry φᵢ up to an additive error of ±εD using O(1/ε² log(nm)) bits of space, where D is the number of distinct elements in the stream. We considerably improve on this result by designing an algorithm which estimates the whole profile vector φ, up to overall error ±εm, using O(1/ε2 log(1/ε) + log(nm)) bits. More formally, we give an algorithm that computes an approximate profile φˆ such that the L₁ distance ‖φ − φˆ‖₁ is at most εm. In addition to bounding the error across all coordinates, our space bound separates the terms that depend on 1/ε and those that depend on n and m. Furthermore, we give a lower bound showing that our bound is optimal up to constant factors.
To achieve these results, we introduce two new techniques. First, we develop hashing-based sketches that keep very limited information about the identities of the hashed elements. As a result, elements with different frequencies are mixed together, and need to be unmixed using an iterative “deconvolution” process. Second, we reduce the randomness used by the algorithms in a somewhat subtle way: we first use Nisans generator to ensure that the random variables of interest are O(1)-wise independent, and then we analyze those variables by calculating their moments. (In our setting, using Nisans generator alone would not yield the desired space bound.) The latter technique seems quite versatile, and has been already used for other streaming problems [Ano23]. |
first_indexed | 2024-09-23T10:19:03Z |
format | Thesis |
id | mit-1721.1/150228 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:19:03Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1502282023-04-01T03:47:32Z Estimating Frequency Distributions in Data Streams Chen, Justin Y. Indyk, Piotr Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Streaming algorithms allow for space-efficient processing of massive datasets. The distribution of the frequencies of items in a large dataset is often used to characterize that data: e.g., the data is heavy-tailed, the data follows a power law, or there are many elements that only appear only once or twice. In this thesis, we focus on the problem of estimating the profile (a vector representation of the frequency distribution). Given a sequence of m elements from a universe of size n, its profile is a vector φ whose i-th entry φᵢ represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry φᵢ up to an additive error of ±εD using O(1/ε² log(nm)) bits of space, where D is the number of distinct elements in the stream. We considerably improve on this result by designing an algorithm which estimates the whole profile vector φ, up to overall error ±εm, using O(1/ε2 log(1/ε) + log(nm)) bits. More formally, we give an algorithm that computes an approximate profile φˆ such that the L₁ distance ‖φ − φˆ‖₁ is at most εm. In addition to bounding the error across all coordinates, our space bound separates the terms that depend on 1/ε and those that depend on n and m. Furthermore, we give a lower bound showing that our bound is optimal up to constant factors. To achieve these results, we introduce two new techniques. First, we develop hashing-based sketches that keep very limited information about the identities of the hashed elements. As a result, elements with different frequencies are mixed together, and need to be unmixed using an iterative “deconvolution” process. Second, we reduce the randomness used by the algorithms in a somewhat subtle way: we first use Nisans generator to ensure that the random variables of interest are O(1)-wise independent, and then we analyze those variables by calculating their moments. (In our setting, using Nisans generator alone would not yield the desired space bound.) The latter technique seems quite versatile, and has been already used for other streaming problems [Ano23]. S.M. 2023-03-31T14:41:05Z 2023-03-31T14:41:05Z 2023-02 2023-02-28T14:36:07.306Z Thesis https://hdl.handle.net/1721.1/150228 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Chen, Justin Y. Estimating Frequency Distributions in Data Streams |
title | Estimating Frequency Distributions in Data Streams |
title_full | Estimating Frequency Distributions in Data Streams |
title_fullStr | Estimating Frequency Distributions in Data Streams |
title_full_unstemmed | Estimating Frequency Distributions in Data Streams |
title_short | Estimating Frequency Distributions in Data Streams |
title_sort | estimating frequency distributions in data streams |
url | https://hdl.handle.net/1721.1/150228 |
work_keys_str_mv | AT chenjustiny estimatingfrequencydistributionsindatastreams |