Materialization strategies for web based search computing applications
In the thesis we provide a characterization of view materialization in the context of multi domain heterogeneous search application. Web data view materialization is presented as a solution for technical constraints and problems implied by the unknown structure of the web data sources. The web data materialization model extends the search computing (SeCo) multi-layered model, where the search services are registered in a service repository that describes the functional (e.g. invocation end-point, input and output attributes) information of data end-points. Our first research goal is to solve the problem of finding a sequence of access patterns, which when executed produces a materialization output. For the first research goal we make the following novel contributions: 1) Formulation of the building blocks for the materialization feasibility analysis; 2) Definition of the materialization feasibility analysis method and the accompanying algorithms; 3) A detailed empirical study conducted on a set of materialization tasks ranging in their schema dependency complexity.
Our second research goal is the optimization of the materialization process so that the most optimal sequence in terms of materialization output efficiency and quality, executes at all times. For this goal we make the following novel contributions: 1) Formulation of a set of performance dimensions and their metrics for web source materialization; 2) A cost model that utilizes optimization metrics in order to qualitatively differentiate between web services in terms of materialization time; 3) A query optimization procedure that explores the characteristics of the underlying source data domain in order to prioritize the execution of the most productive queries in terms of their data harvesting power; 4) Materialization process optimization strategies based on the web source performance dimension metrics and query optimization procedure; 5) A detailed empirical study conducted on several relevant web based data sources that clearly shows the effectiveness of the proposed solution.