Links

DB2 article

Federation and schema

Oracle virtual directory

Paper Summaries

A federated architecture for database systems (McLeod & Heimbigner, 1980)

This paper details a federated approach to managing decentralized database systems. A decentralized database is a collection of structured information that may be logically and/or physically distributed. Logical decentralization is the division into components for allowing separate access, and physical decentralization is about the allocation of data to different nodes. First, arguments favoring both types of decentralization are presented. Then a federated system is described as consisting of a number of logical components, related, but independent, and each with its own component schema, defined by a component DBA. The principal goal of each component is to satisfy its most frequent and important users (usually the local ones). These disparate components are held together by a federal schema, which contains the federal controller and is administered by a federal DBA. The federal DBA resolves conflicts between components and defines federal schema(s), relate them to component schemas, and specifies each component’s interface.

There are many options available for logical distribution – a single, global federal schema, a separate federal schema for each pair of components, associating a federal schema with each component, or a hierarchy of federal schemas. Given a federated setup, there is a range of possible views that a given component user can have. At one end, the federal and local schemas are so well integrated that the user cannot tell which data is being accessed. In this setup, the user could be oblivious to potentially expensive non-local references. On the other end, the federal schema is separate from the local one and therefore the user would have to address each one separately.

With physical distribution, better performance would result as a result of placing the data close to principal sources and users. Moreover, redundantly storing data would provide reliability and survivability. The complexity that exists from the presence of duplicate data as well as that which results from the combination of logical and physical decentralization must be addressed.

The Federal Controller is the entity that is charge of the federation. The controller should perform seven steps (specified in paper) for each request from a component. There are three approaches to federal controller placement – as a special node on the network, co-located on a node with another component, or parts of it may be distributed across different components.

The component controllers should have three important features – allow local users and the federal controller concurrent access to data, communicate results back to federal controller, and recognize locally-issued requests for which the federal schema is needed and forward those to the federal controller.

Querying Distributed RDF Data Sources with SPARQL (Quilitz & Leser, 2007)

The paper described the architecture and functioning of DARQ - a first of its kind SPARQL query engine that provides transparent query access to multiple, distributed endpoints. First, a brief overview of techniques that exist for relational databases as well as features of the SPARQL query language is provided.

A query is processed in four stages – parsing, query planning, optimization, and query execution. DARQ has the architecture of a mediator based information system (MBIS), with the query engine playing the role of the mediator. Non-RDF data sources are wrapped with tools such as D2R and SquirellRDF. DARQ has access to service descriptions of the data sources, which it uses for query planning and optimization.

A Service description, which is represented in RDF, contains data descriptions, which exist as capabilities. Capabilities are expressed as set of tuples (p,r), where p is a predicate in the data source and r is a constraint on the subject and the object. It also contains statistical information, which includes the total number of tuples in a data source and three sets of optional information for each capability – number of triples with a given predicate, selectivity if the subject is bound, and selectivity if the object is bound. DARQ also supports querying of data sources that have limitations on access patterns in their service descriptions.

Query planning - the process of deciding which data source can contribute to an answer - is performed separately for each filtered basic graph pattern in a SPARQL query. Relevant data sources are found by matching triple patterns against capabilities of data sources by comparing their predicates. As a consequence, DARQ only supports queries with bound predicates. The results of source selection are used to build sub-queries, which are represented as triples, made up of triple patterns, a set of value constraints, and data source identifier, that are sent to appropriate data sources.

The optimizer attempts to build a feasible and cost-effective query plan after considering access limitations. Optimization consists of logical optimization and physical optimization. In logical optimization, first, the query is rewritten so that basic graph patterns are merged. Second, value constraints are moved into sub-queries, when possible, to reduce the size of intermediate results. Physical optimization has the goal of finding the best execution plan and uses a cost model to do it. Since network latency and bandwidth have the highest impact on execution time when querying with distributed sources, the goals of DARQ is to reduce the amount of data transferred and the number of data transmissions. Iterative dynamic programming is used for optimization and two join implementations are supported – nested-loop join and bind join. Only the latter is possible for sources with limitations on access patterns. A simple cost model is used to estimate the size of a query result.

Experiments with four queries show that optimizations significantly improve execution time. Bind joins of all sub-queries in order of appearance were used in the case of queries without optimization. While all optimized queries returned results, three of the four non-optimized queries timed out after 10 minutes. It should be noted that limited statistical information was used for optimization. Also, optimized queries would not be better than the non-optimized ones, if sub-queries are very unselective.

Future work could take up improving the amount and usage of statistical information as well as including mapping and translation rules between vocabularies of different SPARQL endpoints.

Sharing Data on the Grid Using Ontologies and distributed SPARQL Queries (Langegger, Blochl & Woss, 2007)

Grid Computing is the term used to describe the attempts to bring the disparate sources connected by the internet to their collaborative potential. For scientific collaborations, large scale sharing of scientific data between grids is needed. Virtual integration of distributed, heterogeneous data sources based on Semantic Web technology would facilitate such sharing. The authors propose a middleware that can enable such integration. Such middleware must be able to map local data models and schemas to several global domain ontologies.

As a first step, local ontologies were created and then mapped to global concepts – an approach that became too complex due to the inherent heterogeneity. As a second approach, a mediator-wrapper architecture was chosen. In this setup, clients connect to the mediator and submit SPARQL queries corresponding to global domain ontologies. The mediator parses the query and generates multiple query plans using an iterative dynamic programming algorithm. The various wrappers help overcome the model and schema heterogeneity by providing specific access methods. For those data sources for which direct access to the data is not available, the wrapper is part of the mediator. The mediator has a local cache (to perform global joins) as well a catalog, which contains global domain ontologies and information about registered data sources.

Authors acknowledge that this work is only a first step and that performance tests, extensions, and improvements in optimization will be needed to prove the effectiveness of this approach.

Query Rewriting for Semantic Web Information Integration (Kolas, 2007)

This paper describes attempts to apply existing work on information integration systems to information integration using semantic web technologies. First, it presents a mediator-wrapper architecture for which the type of query rewriting the paper describes is applicable. Semantic Query Decomposition, which is the mediator, receives the query, derives a query plan over various sources, and optimizes and executes this plan. Queries are expressed in SPARQL, using terms in the domain ontology (defined in OWL). Source descriptions are described in data source ontologies in OWL as well. Mappings from source ontology to the domain ontology are described by rules defined in OWL. As far as wrappers go, the paper assumes that there exists functioning Semantic Bridge for Relational Databases and Semantic Bridge for Web Services that are capable of performing appropriate mappings.

Then, two approaches – Global-as-View (GAV) and Local-as-View (LAV) – for restricting mappings between local and target schemas to make query rewriting more tractable are presented. Advantages and disadvantages of each approach in regards to a number of goals – independent domain ontology, mutually independent data source to domain ontology mappings, extracting all meaning from source to domain ontology, and a natural mapping between source and domain ontologies – are laid out and a case is made for a modified GAV approach. Independence of domain ontology and mutually independent data source to domain ontology mappings are achieved by removing restrictions for a traditional GAV rule in SWRL and by adding additional restrictions. Further restrictions are applied to glean negations from information contained in relational databases, which operate in a closed world unlike Semantic Web technologies that are based on an open world assumption and monotonic reasoning.

A working prototype based on the architecture has been developed. SQD, SBRD, and SBWS are in the works.

Applying Semantic Web in Mobile and Ubiquitous Computing: Will Policy-Awareness Help? (Lasilla, 2005)

Mobile and Ubiquitous computing presents many technological challenges that are not present in the personal computing model. Much attention has been focused on bettering the user interface and interaction of these devices; however, users are often “attention-constrained” while making use of these gadgets and therefore autonomous operation of mobile devices is an area that deserves more attention. Automation is best enabled when systems are more interoperable.

In mobile computing environments, devices should be capable of dynamically discovering other systems and be able to form coalitions, including with those devices and services not encountered before, without unnecessary interactions with humans. In order to achieve such “serendipitous interoperability”, qualitatively stronger representations of device/service descriptions are necessary. Semantic Web Services may serve as a suitable and better paradigm for representing device functionalities, when compared to the traditional standards-based approach, which requires anticipation of future scenarios.

Context awareness can prove useful for ubiquitous computing in adapting to different usage situations and locations along multiple dimensions of system operation – information retrieval, user interfaces, service discovery, security & privacy, and automation & autonomy. Semantic Web techniques pertaining to policy-awareness can boost mobile and ubiquitous computing in multiple ways as well – access, autonomy, and contracts.

Semantic Integration of Relational Data using SPARQL (Weng, Miao, Zhang, & Lu, 2008)

It is necessary to understand and interact with disparate source schemas to provide users with uniform access to heterogeneous data sources. This paper provides a mediator-wrapper architecture based solution for data integration. Ontology is used as mediated schema for describing data sources. A logic-based approach is used since existing work on query rewriting and query answering using views is not applicable to ontology based mediated schema.

There are three sub-parts to the data integration problem – mediated schema, query language, and query rewriting algorithms. As mentioned above, ontology is used for mediated schema, and SPARQL fills in nicely as a solution to the query language problem. The third problem involves rewriting a conjunctive query using views. MiniCon is an effective algorithm for this task and has been refined in this work.

The architecture is comparable to other mediator-wrapper architectures. However, the query processor, in addition to the parser, query rewriter, and the query dispatcher, contains a translator, which converts SPARQL to Datalog. The translator is necessary since MiniCon uses Datalog. The mediated schema facilitates a uniform query interface and serves as a shared vocabulary set for the wrappers. Data source descriptions exist as SPARQL queries. Wrappers translate incoming queries to source specific queries. A translation algorithm, presented in the paper, is used to convert SPARQL to Datalog.

SPARQL features such as CONSTRUCT and ASK were not used in this work. The architecture maybe extended to RDF, XML, web forms, etc.

OptARQ: A SPARQL Optimization Approach based on Triple Pattern Selectivity Estimation (Bernstein, Kiefer, & Stoker, 2007)

A technical report that explains a static optimization algorithm called OptARQ. The aim of OptARQ is to reduce the intermediate result sets of triple patterns. The report is organized as follows. First, the concept of selectivity is explained followed by a description of selectivity estimation. Then, a cost function, which is used to rank triple patterns, is defined. The overall cost for a triple pattern is the product of the costs of its subject, predicate, and object (see paper for the definition of each). A statistical model that makes use of the cost function and its components are described. The optimization consists of rewriting FILTER variables, moving up FILTER conditions closer to the relevant triple pattern, and reordering by selectivity. Only those FILTER expressions connected by a logical AND operator can be decomposed. Moreover, a value in a FILTER expression can only be used to replace a variable in a triple pattern only if the equal (=) operator is used in the expression. Lastly, a variable cannot be substituted if the variable is reference somewhere else in the query.

OptARQ was compared against ARQ, KAON2, Sesame SPARQL, and Sesame SeRQL using the SwetoDblp Dataset. It was found to execute non-optimized queries faster than the four other frameworks.

This work has some important limitations. SPARQL variables constrained by an inequality operator are not considered in estimating selectivity. The calculations of object and subject selectivity are very basic and imprecise. Also, this framework does not consider the behavior of triple pattern selectivity for joined variables. Lastly, this query optimization fails to optimize queries that have OPTIONAL or UNION keywords.

Design and Implementation of the CALO Query Manager (Ambite et. al, 2006)

Networked Graphs: A Declarative Mechanism for SPARQL Rules, SPARQL Views and RDF Data Integration on the Web (Schenk & Staab, 2008)

A Semantic Web Middleware for Virtual Data Integration on the Web (Langegger, Woss, & Blochl, 2007)