Performance evaluation and extension of Cachejoin in a real-life environment
Active or real-time data warehousing is becoming very popular in business intelligence domain. In order to build a real-time or active data warehouse an online processing of stream of end users’ transaction with disk-based master data is required. This is also called processing of semi-stream data. Fundamentally, this semi-stream processing is a process of joining an incoming stream data (transactional data) with the disk-based slow retrieving master data by using an effective join operator. Typically this join operator works with a limited amount of main memory which cannot hold the entire disk-based master data. Recently a number of semi-stream join algorithms have been proposed in the literature. Most of these algorithms have been tested using synthetic dataset while only a few using real-life dataset. It is always interesting to see how these algorithms behave in real environment. As each semi-stream join performs differently under the different characteristics of the stream data, it is important to select appropriate semi-stream join based on the characteristics of the stream data. Also these join algorithms use different strategies to access the disk-based master data e.g. index (clustered index or non-clustered index) or no index. Based on an intensive literature review, in this thesis we select a well-known semi-stream join CACHEJOIN (Cache Join) and implement it in MITRE 10 NZ, one of the leading home improvement and hardware retail store. We study the behavior of the algorithm under two different datasets (synthetic dataset and MITRE 10 NZ dataset). We study the performance of the algorithm under both datasets. Our performance study shows that under MITRE 10 NZ dataset CACHEJOIN performs very closer to that of synthetic dataset. As an extension of our work we find that MITRE 10 NZ incoming stream data (transactional data) needs to join with two tables in disk-based master data. First join is performed with product table (sc) using stock_code as a join attribute. While second join is performed with customer table (cs_person) using account_code as a join attribute. This gives us an opportunity to extend our existing CACHEJOIN for two-stage join. The stream tuples move to the second stage as soon as they complete the first stage. The performance of two-stage join is studied against normal CACHEJOIN using MITRE 10 NZ dataset. After analyzing the performance we are confident that extended CACHEJOIN performs reasonably well for MITRE 10 NZ real environment. As a future work, we have a plan to explore more in two-stage join by trying different semi-stream joins and find out the best join combinations, and also explore more on parallelization of running 2 parallel nodes to handle the future growth of MITRE 10 NZ transactional data.