HYBRIDJOIN for Near Real-time Data Warehousing
Files
Date
Authors
Supervisor
Item type
Degree name
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to the end user updates, near real-time data integration is required. An important phase in near real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm, Hybrid Join (HYBRIDJOIN), performs well in general but has not been optimized for real world conditions. In real world market economics, a few products are sold more frequently as compared to the rest of the products; therefore, a large number of sale transactions relate to a small portion of master data. In the transformation phase, to join the input stream of sales transactions with disk-based master data, HYBRIDJOIN loads that particular part of master data each time from the disk, increasing the disk access cost significantly with a negative effect on performance. Contrarily, X-HYBRIDJOIN stores that particular part of master data in memory permanently, eliminating the disk access overhead significantly. To validate the arguments and analyze the performance of X-HYBRIDJOIN an experimental study is conducted.