Performance Evaluation and Extension of Cachejoin in a Real-Life Environment

Georgewashington, Solomon George

Performance Evaluation and Extension of Cachejoin in a Real-Life Environment

Files

Whole thesis(4.51 MB)

Date

2015

Authors

Georgewashington, Solomon George

Supervisor

Naeem, Muhammad Asif

Tegginmath, Shoba

Weber, Gerald

Item type

Thesis

Degree name

Master of Computer and Information Sciences

Publisher

Auckland University of Technology

Abstract

Active or real-time data warehousing is becoming very popular in business intelligence domain. In order to build a real-time or active data warehouse an online processing of stream of end users’ transaction with disk-based master data is required. This is also called processing of semi-stream data. Fundamentally, this semi-stream processing is a process of joining an incoming stream data (transactional data) with the disk-based slow retrieving master data by using an effective join operator. Typically this join operator works with a limited amount of main memory which cannot hold the entire disk-based master data. Recently a number of semi-stream join algorithms have been proposed in the literature. Most of these algorithms have been tested using synthetic dataset while only a few using real-life dataset. It is always interesting to see how these algorithms behave in real environment. As each semi-stream join performs differently under the different characteristics of the stream data, it is important to select appropriate semi-stream join based on the characteristics of the stream data. Also these join algorithms use different strategies to access the disk-based master data e.g. index (clustered index or non-clustered index) or no index. Based on an intensive literature review, in this thesis we select a well-known semi-stream join CACHEJOIN (Cache Join) and implement it in MITRE 10 NZ, one of the leading home improvement and hardware retail store. We study the behavior of the algorithm under two different datasets (synthetic dataset and MITRE 10 NZ dataset). We study the performance of the algorithm under both datasets. Our performance study shows that under MITRE 10 NZ dataset CACHEJOIN performs very closer to that of synthetic dataset. As an extension of our work we find that MITRE 10 NZ incoming stream data (transactional data) needs to join with two tables in disk-based master data. First join is performed with product table (sc) using stock_code as a join attribute. While second join is performed with customer table (cs_person) using account_code as a join attribute. This gives us an opportunity to extend our existing CACHEJOIN for two-stage join. The stream tuples move to the second stage as soon as they complete the first stage. The performance of two-stage join is studied against normal CACHEJOIN using MITRE 10 NZ dataset. After analyzing the performance we are confident that extended CACHEJOIN performs reasonably well for MITRE 10 NZ real environment. As a future work, we have a plan to explore more in two-stage join by trying different semi-stream joins and find out the best join combinations, and also explore more on parallelization of running 2 parallel nodes to handle the future growth of MITRE 10 NZ transactional data.

Keywords

Extension of Cachejoin, Semi-stream join

Permanent link

https://hdl.handle.net/10292/9257

Collections

Masters Theses

Full item page

Performance Evaluation and Extension of Cachejoin in a Real-Life Environment

Files

Date

Authors

Supervisor

Item type

Degree name

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Source

DOI

Publisher's version

Rights statement

Permanent link

Collections