Summary: | In recent years, there has been a dramatic growth in the number of publicly accessible databases on the Internet and all indicators suggest that this growth should continue in the years to come. Unfortunately, retrieving information from these databases is not easy for several reasons. The first complication is distribution. Not every query can be answered by the data in a single database. Useful relations may be broken into fragments that are distributed among distinct databases. In horizontal fragmentation, the rows of a database are split across multiple databases. In vertical fragmentation, the columns are split. Distributed databases can exhibit mixtures of these types of fragmentation A second complication in database integration is heterogeneity. This heterogeneity may be notational or conceptual. Notational heterogeneity concerns access languages and protocols. One source may require SQL while another requires OQL and a third uses an ad hoc notation. This sort of heterogeneity can usually be handled through commercial products (such as the Sybase Openserver). However, even if we assume that all databases use a standard language and protocol, there can still be conceptual heterogeneity, i.e., differences in the relational schema and vocabulary. Distinct databases may use different words to refer to the same concept andor they may use the same word to refer to different concepts. Reassembling the distributed fragments of a database in the face.of heterogeneity is doubly difficult. Mediation is a technology which inserts intelligent processing modules, called mediators, between servers and clients to provide value-added processing. A number of contractors have now the capability to build the required application interfaces and to implement the architecture. The number of platforms and languages varies and there is some discussion on style, as preferring fat versus thin mediators. They interact with their customers to acquire domain knowledge. As more implementations enter practice, the infrastructure grows and we expect that mediators can be installed rapidly and be maintained by their owners. The main goal of this research is to transform the problem of answering queries using views into a semantic query optimization problem (which we called semantic query transformation since it interleaves the query planning and query execution processes) and to show that additional semantic knowledge in the form of integrity constraints can help in generating more efficient query plans suitable for data integration systems over network-bound, autonomous data sources ranging from conventional databases on the LAN or intranet to Web-based sources (both HTML and XML) across the Internet. In doing so, four derived goals were identified: to present a language for the modeling of the contents of the information sources, to propose algorithms, which transform answering queries using views problem into a semantic query optimization problem, to extend the algorithms to find the maximally contained query plans in the presence of hnctional dependencies in the world schema and to test the completeness and soundness of the algorithms.
|