Novel methods for distributed and privacy-preserving data stream mining
The growing number of “big” datasets present many opportunities for data mining, but also raise a variety of new challenges. Datasets may take the form of continuous streams with constantly changing patterns, they may be too widely distributed to be centralised for analysis at a single location, or they may contain sensitive values that data owners are not willing to share due to privacy concerns. Much past research has considered these issues individually, but few existing methods can address combinations of these properties. Therefore, this research develops methods for distributed and privacy-preserving data stream mining: a novel Hierarchical Distributed Stream Miner (HDSM) that learns relationships between the features of separate streams with minimal data transmission to central locations, and two data perturbation methods for privacy-preserving stream mining based on the combination of random projection, random translation, and additive noise. Experimental evaluation of HDSM demonstrates significant improvements in classification accuracy over existing distributed stream mining approaches while minimising data transmission and computational costs. HDSM’s ability to dynamically trade-off accuracy with these costs is also demonstrated. Variations of the known input-output Maximum A Posteriori (MAP) attack are developed to experimentally evaluate the data perturbation methods, and the proposed composite methods are shown to achieve a better trade-off between privacy and model accuracy than random projection alone. Finally, an approach is described for combining HDSM with data perturbation to achieve distributed privacy-preserving stream mining.