Querying Heterogeneous Semi-Structured Data
by Vassilis Christophides, Michel Scholl and Anne-Marie Vercoustre
The central issue in the Aquarelle project is to provide a uniform
access to heterogeneus collections of data on the Internet. The information
discovery system is based on a set of so-called Access Points which provide
a uniform Z39.50 based access to the various collections. This flat and
minimal access model has the advantage of simplicity but does not exploit
the benefits of querying archives using more structure and semantics.
In this minimal approach to heterogeneous data, Access Points are not
used to describe the various cultural objects but just as a support to
map local data structures into a 'common access model'. The advantages
are:
- legacy cultural databases (ie, archives) are not altered within the
Aquarelle network and integration of new information sources is trivial
- semantic discrepancies among heterogeneous data sources are roughly
captured by this minimal access model
- mediation between the Access Server and heterogeneous data servers
is simplified.
However, most of the existing structure and semantics richness of the
archives and folders is lost for querying since there is no translation
foreseen between richly structured archives and folders on the one hand
and the Access Points view in the Access Server, on the other hand. The
data source structure is extremely useful:
- for facilitating query refinement and improving precision, compared
to keyword or full-text based search
- for addressing more easily fine grain chunks of information compared
to hypertext navigation
- for enabling sophisticated data integration from various data sources.
New Trends in Querying Heterogeneous Data Sources
Providing integrated access to multiple, distributed, heterogeneous
databases and other information sources has been studied in the database
research community for well over a decade from Multidatabase/Federated
approaches (Pegasus, Amos.Garlic) to new generation mediator based systems
(TSIMMIS, DISCO, Information Manifold).
The common feature to all multidatabase architectures is the existence
of a Canonical or Common Data Model (CDM) to reduce the complexity of the
problem of mapping data and commands between the different data models
and languages of the component sources. Such an approach is appropriate
for integrating a small number of sources whose structure is known and
stable.
New-generation systems are interested in integrating a large number
of sources storing data, possibly with no structure or with implicit structure,
such as the Web or Information Retrieval Systems (IRS). In this context,
a global schema, or even federated schemas is hard to implement: the emerging
mediation services embed the knowledge allowing for processing specific
sources of information. Each source is wrapped with a translator (or wrapper)
that logically converts the underlying data objects into a common information
format.
Querying and Integrating Heterogeneous Data with Incomplete Structure
In the context of Aquarelle, the Verso research team of INRIA has been
exploring a complementary approach to querying and integrating heterogeneous
databases: use of a language called POQL developed in the project on top
of the DBMS O2, as a first step towards querying data without complete
knowledge of their structure. The idea is that the user does not know in
advance which servers to query.
Instead of artificially mapping the folders structure semantics onto
Aquarelle Z39.50 Access Points (APs), one might use POQL for APs based
queries. The fact that the structure does not have to be totally specified
allows for integrating, to a certain extent, several sources which do not
have the same structure. The power of this approach to query both structure
and data at the same time is illustrated by the INRIA demo (http://cosmos.inria.fr:8080/poql.html)
on a database of the french Inventaire whose documents obey the SGML CI
DTD.
For instance, a user who wants to find all folders containing Cognac
in their title, would issue the following query:
- select f
from Folders{f}@P.#A(x)
where x contains Cognac and name(#A) contains tl;
where Folders is the name of a folder server database, f is a variable
ranging over the folders, @P is a path variable allowing to express
navigation through the unknown structure of folders, and #A a
variable ranging over the attributes ending the paths. Then, the filtering
condition specifies that the required attributes (ie, the Access Points)
contain tl in their name and the corresponding values x contain the string
Cognac. This is logically equivalent to the definition of an Access Point
title and its corresponding mapping to the related elements of folders
as for instance tl-cl (for classeurs), tl-th (for thematics),
tl-dos (for dossiers), tl-obj (for objets) in the CI
DTD.
The advantage of POQL is that one does not have to specify in the query
all possibilities (tl-cl, tl-th, tl-dos, etc) that we do not have to specify
the paths to access to those classeurs, thematics, etc. Furthermore, and
this is more important, the POQL queries are in a certain degree independent
of the data structure.
Geo-Referenced Navigation and Querying
In many culture heritage applications, folders are related to geographical
areas, ie they are geo-referenced by a point or a zone depending on the
scale. The association of folders to geographical maps is useful for at
least two reasons:
- user interface, navigation, query refinement: instead of access point
based access to information, the user might want to navigate through a
geographical area zooming from a country scale down to a county scale before
deciding which folder(s) to access. At each scale, points featuring folders
are displayed on a background map
- querying: as in geographical information systems (GIS), the user might
want to combine the usual search criteria with spatial ones : "give
me the folders associated with the 18th century farms located within 5km
from Cognac".
The current experiment by the French Ministry of Culture (Inventaire)
jointly with INRIA and Euroclid aims at prototyping the access through
the web to geo-referenced folders structured according to the CI SGML DTD
including the two above features. The results of this experiment should
be looked at in the Aquarelle context.
Please contact:
Michel Scholl INRIA
Tel: +33 01 39 63 53 29
E-mail: Michel.Scholl@inria.fr
Anne-Marie Vercoustre INRIA
Tel: +33 01 39 63 56 62
E-mail: Anne-marie.Vercoustre@inria.fr