Resource Optimization in Heterogeneous Distributed Data Stream Mining with Performance Assessment on Asthma Hospitalization Predictive Modeling
The Big Data Era has presented many opportunities for data mining techniques to discover knowledge patterns across an exponentially growing diverse collection of data. In many application domains, data exists in a distributed fashion across geographical locations; as such, the nature of the data collected by each location may differ from its peer nodes in its network, thus causing heterogeneity in the data. Such scenarios of distributed databases require mining methods that are distinct from traditional homogeneous distributed databases where the structure is identical across locations. The current demand is to analyze real-time data; consequently, stream computing is becoming a popular choice. Data generated continuously at a high pace at distributed sites or locations are termed as distributed data streams. Recent approaches toward Distributed Data Mining (DDM) have focused on addressing the heterogeneous nature of data sources. However, such approaches do not prioritize the reduction of data communication costs which could be prohibitive in large-scale sensor networks where bandwidth is a limited resource. In fact, higher communication and computational costs are the two most prominent problems encountered in heterogeneous distributed environments. An effort to decrease communication in the distributed environment adversely influences classification accuracy; therefore, a research challenge lies in maintaining a balance between transmission cost, computational cost, and accuracy.
This research covers the heterogeneous distributed data mining problem, extendable to the case where data arrives continuously in streaming mode. We propose a suite of algorithms to address specific issues in mining data from heterogeneous distributed streaming settings. Our experimental testing reveals that performance efficiency can be achieved across a wide range of datasets. The first algorithm, Performance Optimizer in Distributed Stream Mining (PODSM), having its roots in Bayesian Inference, is targeted towards reducing the communication volume and resource time in a heterogeneous DDM environment while retaining prediction accuracy. A reduction of 34.66% in communication was obtained for one of the datasets with nearly 27% savings in resource time. The second algorithm, Minimized Tree for Distributed Mining (MTDM), presents an efficient and robust method for learning the relationship between various distributed sites using a tree. In this regard, a saving of 37.65% in resource time has been reported for one dataset while improving the accuracy by 1.33%. To assess the algorithms’ competency, we validated them on a case study built using real datasets from real-world sources to predict demands for asthma-related emergency hospitalizations into Low or High classes. Considerable savings in terms of communication and resource time were attained upon execution of PODSM and MTDM while preserving accuracy levels, thus portraying their potential to achieve a good trade-off between accuracy and resource utilization. The study concludes that PODSM and MTDM are proficient in conjoint servicing heterogeneous distributed data sources in any resource-constrained scenario. Moreover, the capability of the algorithms to maintain a balance between accuracy, communication, and resource time makes them flexible enough for a diverse range of applications.