HYBRIDJOIN for Near Real-time Data Warehousing
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to the end user updates, near real-time data integration is required. An important phase in near real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm, Hybrid Join (HYBRIDJOIN), performs well in general but has not been optimized for real world conditions. In real world market economics, a few products are sold more frequently as compared to the rest of the products; therefore, a large number of sale transactions relate to a small portion of master data. In the transformation phase, to join the input stream of sales transactions with disk-based master data, HYBRIDJOIN loads that particular part of master data each time from the disk, increasing the disk access cost significantly with a negative effect on performance. Contrarily, X-HYBRIDJOIN stores that particular part of master data in memory permanently, eliminating the disk access overhead significantly. To validate the arguments and analyze the performance of X-HYBRIDJOIN an experimental study is conducted.