HotSpot: Anomaly Localization for Additive KPIs With Multi-Dimensional Attributes

Additive key performance indicators (KPIs) (such as page view (PV), revenue, and error count) with multi-dimensional attributes (such as ISP, Province, and DataCenter) are common and important in monitoring metrics in Internet companies. When an anomaly happens to an overall KPI, it is critical but...

Full description

Bibliographic Details
Main Authors: Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu, Xuanyou Guo
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8288614/
Description
Summary:Additive key performance indicators (KPIs) (such as page view (PV), revenue, and error count) with multi-dimensional attributes (such as ISP, Province, and DataCenter) are common and important in monitoring metrics in Internet companies. When an anomaly happens to an overall KPI, it is critical but challenging to localize the root cause, which is one (or more) combination of attribute values in multiple dimensions. For example, is the total PV decrease caused by the PV decrease from “Beijing”or “China Mobile in Beijing”, or “Beijing and Shanghai”? However, this task is very challenging for two major reasons. First, the PVs of different combinations are interdependent; thus, the PV anomalies at the root cause can cause the changes of many other PVs at different aggregation levels. Second, there could be tens of thousands of combinations to investigate in multi-dimensional attribute space. It is a difficulty to find the root cause from a huge search space. To address the first challenge, our approach HotSpot uses a novel potential score based on the ripple effect for anomaly propagation that we reveal. To address the second challenge, HotSpot adopts the Monte Carlo Tree Search algorithm and a hierarchical pruning strategy. Using the real-world data from a top global search engine, we show that HotSpot achieves a great improvement on effectiveness and robustness, i.e., 95% of all types of root cause cases using HotSpot (compared with only 15% using existing approaches) achieves an F-score over 90%. Operational experiences show that HotSpot can reduce the localization time from more than 1 h in manual efforts to less than 20 s.
ISSN:2169-3536