Summary: | Traditional data mining techniques expect all data to be managed
within some form of persistent datasets. Recently, for many
emerging applications, such as stock tickers, web-click streams,
and telecom call records, the concept of a \textit{data stream} is
more appropriate than a stored dataset. Naturally, a data stream
is generated continuously in a dynamic environment with huge
volume, infinite flow, and fast changing behaviors. Furthermore,
they usually arrive to a mining system in a push-based manner
meanwhile system resources used in the mining process are
generally restricted in advance. Consequently, there have been
increasing demands for developing novel techniques that are able
to discover interesting patterns from data streams while they work
within system resource constraints. Moreover, the mining results
returned by these techniques are often desirable to be guaranteed
within some error. When such an important task is completed, it is
strongly believed that the quality of making decisions can be
improved significantly in streaming environments. This research
aims to study and investigate various approximation algorithms in
order to effectively and efficiently mine useful patterns from
data streams under different system resource constraints. Two
fundamental data mining tasks are explored in the streaming data
context: frequent pattern discovering and cluster analysis. The
contributions of this research are claimed as follows:
A novel algorithm named EStream is developed to address the
problem of online mining frequent patterns from data streams with
precise error guarantee.
|