Capturing recurring concepts in high speed data streams

Sakthithasan, Sripirakas

Capturing recurring concepts in high speed data streams

Files

Whole thesis(3.4 MB)

Date

2015

Authors

Sakthithasan, Sripirakas

Supervisor

Pears, Russel

Item type

Thesis

Degree name

Doctor of Philosophy

Publisher

Auckland University of Technology

Abstract

This research addresses two key issues in high speed data stream mining that are related to each other. One fundamental issue is the detection of concept change that is an inherent feature of data streams in general in order to make timely and accurate structural changes to classification or prediction models. The shortcomings in the past research were addressed in two versions of a change detector that were produced during this research. The second major issue is the detection of recurring patterns in a supervised learning context to gain significant efficiency and accuracy advantages over systems that have severe time constraints on response time to change due to safety and time critical requirements. Capturing recurrent patterns requires the detection of concept change with minimal false positives. This research addresses this latter problem as a pre-requisite to formulating a novel mechanism for recognizing recurrences in a dynamic data stream environment.

The first approach to change detection, termed SeqDrift1 that relies on a detection threshold derived using the Bernstein bound and sequential hypothesis strategy ensured much lower false positive rates and processing time than the most widely used change detector, ADWIN.

The second version of the change detector, SeqDrift2, achieved significant improvement on detection sensitivity over SeqDrift1. This was achieved through two separate strategies. The first was the use of reservoir sampling to retain a larger proportion of older instances thus providing for better contrast with newer arriving instances belonging to a changed concept. The second strategy was to trade off false positive rate for detection delay in an optimization procedure. The net result was that SeqDrift2 achieved much lower detection delay than SeqDrift1 but sacrificed some of its false positive rate when compared to SeqDrift1, while still retaining its superiority with respect to this measure vis-à-vis ADWIN and other change detectors.

Having proposed a robust and efficient mechanism for change detection two different meta-learning schemes for recurrent concept capture were proposed. A novel framework using the two schemes consists of concept change detectors to locate concept boundaries, a Hoeffding tree compressor to exploit the application of Discrete Fourier Transform on Decision Trees to produce compact Fourier Spectra, a forest of Hoeffding Trees to actively learn and a pool of Fourier spectra to be reused on similar recurring concepts.

In the first scheme, termed Fourier Concept Trees (FCT), each Fourier spectrum is separately stored and reused on similar concepts. Accuracy and memory advantages have been empirically shown over an existing method called, MetaCT. In the second scheme, instead of storing each spectrum on its own, an ensemble approach, Ensemble Pool (EP), was adopted whereby several spectra were aggregated into single composite spectrum. The major advantage of this strategy over the first was the reduction in storage overhead as redundancies in separate spectra are eliminated by merging into one single entity. In addition, Fourier spectrum generation is optimized with theoretical guarantees to suit high speed environments. Extensive experimentation that demonstrated the benefits including accuracy stabilization, memory gain, reusability of existing models etc., has been done with a number of synthetic and real world datasets. This includes a case study on a Flight simulator system which is one of the target applications of this research.

Keywords

Recurrent concepts , Concept drift detection , Data stream mining , Discrete Fourier transform , Bernstein bound , Sequential hypothesis testing , Reservoir sampling

Permanent link

https://hdl.handle.net/10292/8647

Collections

Doctoral Theses

Full item page