A Staged Approach to Classification in High Speed Concept Drifting Data Streams
Data stream classification task needs to address challenges of enormous volume, continuous rapid flow, and concept drift of data in the presence of limited computer resources. A successful classifier is expected to result in higher accuracy and throughput under constrained memory conditions. This thesis aims at the problem of increasing throughput without sacrificing the accuracy of the classifier in concept drifting and recurring data streams. The solution introduces a novel stage learning framework that senses the context of data to determine the level of volatility in the stream.
Two learning stages namely, Stage 1 and Stage 2 are defined in accordance with stream volatility. Stage 1 is the high volatility state where many new, previously unseen concepts appear. Stage 1 represents an intensive learning phase in the lifetime of a data stream and hence an ensemble of classifiers is used in this stage as previous research has shown that ensembles excel in such situations. In Stage 1, concepts are captured by the ensemble learner and then stored in an online repository for future use in the event that such concepts recur in the future. In contrast, Stage 2 represents a low volatility state and features a limited learning ability as it relies to a greater extent on concepts already captured in Stage 1. Our empirical results reveal that such a staged approach achieves a significant throughput advantage as a result of the deactivation of computationally expensive ensemble learner in Stage 2.
Two main versions of the stage learning paradigm, namely Staged Online Learner (SOL), and SOL with Incremental Fourier Classifier (SOL-IFC) is presented. The Stage 1 segment of both SOL and SOL-IFC is driven by a forest of Hoeffding decision trees, each of which has its own drift detector. The Stage 2 learning segment of SOL is carried out by Fourier spectral representations of learned decision trees and a derivative in the form of a single decision tree. Although Stage 2 processing overhead of SOL is much smaller than in Stage 1, peaks in processing time occur at concept drifts have to be prevented. Consequently, an extended version of SOL named as SOL-IFC introduces an innovative incrementally adaptive Fourier spectrum arrangement for Stage 2. The IFC comprises of novel noisy feature filtering and instance synopsis generation mechanisms that contribute to further performance enhancements. As per empirical evidence, SOL-IFC outperforms state-of-art algorithms on a combined metric of throughput and classification accuracy while avoiding the issue of peak processing times at drifts with a smoother incremental Fourier adaptation process.
The implementation independence of the staged learning approach is proven by the third version named as MLP-FC that replaces the decision tree forest with a neural network classifier during Stage 1.