TVOR: Finding Discrete Total Variation Outliers Among Histograms

Pearson's chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple's or Myers' ind...

Full description

Bibliographic Details
Main Authors: Nikola Banic, Neven Elezovic
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9306761/
_version_ 1819259969709539328
author Nikola Banic
Neven Elezovic
author_facet Nikola Banic
Neven Elezovic
author_sort Nikola Banic
collection DOAJ
description Pearson's chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple's or Myers' indices handle successfully only specific anomalies. This paper proposes smoothness outliers detection among histograms by using the relation between their discrete total variations (DTV) and their respective sample sizes. This relation is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The deviation of the histogram's DTV from the value predicted by the model is used as the outlier score and the proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions about the histograms' samples' distribution, it has no hyperparameters that require tuning, it is not limited to only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson's chi-squared test. In case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds suspicious histograms. The source code is given at https://github.com/DiscreteTotalVariation/TVOR.
first_indexed 2024-12-23T19:18:27Z
format Article
id doaj.art-2c7f90b933504205b765b8f1d4208cd7
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-23T19:18:27Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-2c7f90b933504205b765b8f1d4208cd72022-12-21T17:34:15ZengIEEEIEEE Access2169-35362021-01-0191807183210.1109/ACCESS.2020.30473429306761TVOR: Finding Discrete Total Variation Outliers Among HistogramsNikola Banic0https://orcid.org/0000-0002-3900-8590Neven Elezovic1Gideon Brothers, Zagreb, CroatiaFaculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, CroatiaPearson's chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple's or Myers' indices handle successfully only specific anomalies. This paper proposes smoothness outliers detection among histograms by using the relation between their discrete total variations (DTV) and their respective sample sizes. This relation is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The deviation of the histogram's DTV from the value predicted by the model is used as the outlier score and the proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions about the histograms' samples' distribution, it has no hyperparameters that require tuning, it is not limited to only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson's chi-squared test. In case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds suspicious histograms. The source code is given at https://github.com/DiscreteTotalVariation/TVOR.https://ieeexplore.ieee.org/document/9306761/Age heapinganomaly detectiondiscrete total variationexpected valuefittinghistogram
spellingShingle Nikola Banic
Neven Elezovic
TVOR: Finding Discrete Total Variation Outliers Among Histograms
IEEE Access
Age heaping
anomaly detection
discrete total variation
expected value
fitting
histogram
title TVOR: Finding Discrete Total Variation Outliers Among Histograms
title_full TVOR: Finding Discrete Total Variation Outliers Among Histograms
title_fullStr TVOR: Finding Discrete Total Variation Outliers Among Histograms
title_full_unstemmed TVOR: Finding Discrete Total Variation Outliers Among Histograms
title_short TVOR: Finding Discrete Total Variation Outliers Among Histograms
title_sort tvor finding discrete total variation outliers among histograms
topic Age heaping
anomaly detection
discrete total variation
expected value
fitting
histogram
url https://ieeexplore.ieee.org/document/9306761/
work_keys_str_mv AT nikolabanic tvorfindingdiscretetotalvariationoutliersamonghistograms
AT nevenelezovic tvorfindingdiscretetotalvariationoutliersamonghistograms