Metadata: An Overview and some Issues
by Keith G Jeffery
There are large problems for information systems today. There
is a need to somehow manage / exploit the explosion of information
appearing on multiple WEB sites with very variable standards of
data quality and currency - and of course there is the need to
know such data sources exist. This has the major aspects of Data
Quality, Query Quality, Answer Quality and the Integration of
Heterogeneous Sources.
Data quality: Data quality can be improved by better data collection
facilities (including help and explanation with examples) and
better validation controlled by constraints, with automated conversions
of unit values if required or necessary.
Query quality: Query quality can be improved not only by classical
query optimisation using knowledge about the database size and
structure, but also by assisting the user to formulate the query
best to meet the requirement - by means of online help, explanation,
examples.
Answer Quality: The answer to a query commonly includes values
and structures that are unfamiliar to the user; explanations and
help, hyperlinked descriptions of units, precision, calibration
or of similar terms could help the user to understand better the
results.
Integration of Heterogeneous Sources: First, there is the need
to know that a source exists and to know something of its characteristics.
Heterogeneous data sources commonly have disparate schemas and
there is a need to understand the differences, even when apparently
reconciled by one of the integration techniques.
The Solution - Metadata
For all of the above to be realised, there is one essential and
common ingredient: metadata. Let us consider briefly how it may
be used in each of the cases:
Metadata for Data Collection: Metadata is necessary for validation
through schema and constraints, using value-sets and domain range
limits and even more sophisticated logic tests. It is necessary
for online help / explanation, and - in the form of a multilingual
thesaurus - for translation.
Metadata for Queries: Metadata is necessary for validation through
schema and constraints, for online help / explanation, for translation,
and for optimisation: both user assistance in proposing more appropriate
terms (synonyms, super- / sub-terms) and performance optimisation
since the metadata stores the structural indexes into the databases,
optimal access paths, optimal query segmentation and distribution
for parallelism and minimal network transfers.
Metadata for Answers: Metadata from the schema and associated
metadata as domain ontology information (in a KBS) is necessary
for answer consolidation, for online help / explanation, and translation.
Metadata for Integration: Metadata can catalogue sources of
information at a high level so that they become visible. The well-known
web indexing systems such as [AltaVista] or [Excite] do this in
a very general way. An example in the field of CRIS (Current Research
Information Systems) is the Bergen system [BergenCRIS] which points
to structured information systems for CRIS. Metadata, when used
with an inferencing mechanism, is the key resource to find matching
data structure and content despite heterogeneous representations
and languages so allowing integration across heterogeneous data
sources.
Similarly, metadata provides the information necessary for customisation
of standard products allowing integration into the desktop / office
environment.
Metadata
At present most Information Systems make very limited use of metadata
and - since it supports all the user-friendly easy-to-use features
and extends the range of available information features outlined
above - perhaps this explains why these Information Systems have
been less successful the few information systems really using
metadata. Having outlined above how useful metadata could be for
Information Systems, let us consider exactly what metadata is.
The aim is to decide what kinds of metadata are useful for Information
Systems and how best to generate, maintain and use metadata for
the benefit of end-users of Information Systems.
Metadata is data about data. It is therefore of great utility:
to any Information System which aspires to be more than a simple,
inflexible unfriendly information source - use of metadata can
allow dynamic optimisation and flexibility and allow integration
over heterogeneous distributed information
to any end-user requiring help, explanation, data quality assurance,
assistance in finding relevant information, assistance in integrating
information from heterogeneous sources.
Distributed RDBMSs use metadata extensively. Web indexing systems
(such as [AltaVista], [ExCite]
.) are based on sophisticated metadata.
Metadata is clearly of great importance. Perhaps the earliest
use of metadata was in computerised library catalogue systems
based on IR techniques where the catalogue card record is metadata
describing the real data in the book or other primary publication.
Sadly this same field of endeavour is where metadata has hardly
been developed further, and yet this is the very area of Information
Systems technology where metadata could exert the greatest leverage.
There have been many attempts to standardise metadata structure
and content for specific application areas. In the world of libraries
the [MARC] standard for catalogue records allows some interworking.
Unfortunately there are more than 50 major variants and so interworking
is not as easy as one might expect. Similarly, in many scientific
areas -eg space science, particle physics - there are metadata
standards. In the world of commerce there is [EDI] / [EDIFACT].
There have been attempts to agree a standard European Patient
Medical Record. Perhaps the most successful is in the field of
Engineering: the EXPRESS language describing the STEP data exchange
format with commercial support [STEPTOOLS].
The increasing requirement for interworking among systems handling
grey electronic literature has caused the internet community to
propose as a metadata standard the [Dublin Core] and, subsequently
to provide converters between the standards [UKOLN]. In the field
of CRIS a common metadata form for exchange has been proposed
and is now used for metadata catalogs in the ERGO Project.
The great spread of Web has increased dramatically the requirement
for metadata standards to allow a global browsing and querying
capability. The creation of [W3C] (World Wide Web Consortium)
provided the forum for intense work on metadata [W3Cmetadata].
The main results have been PICS (Platform for Internet Content)
which allows categorisation of Web information in a way similar
to film censorship, and following the Netscape MCF (meta content
framework) and Microsoft XML-Data proposals, the W3C standard
named RDF (Resource Description Framework - which is XML based)
has gained widespread acceptance and subsumes PICS.
Kinds
Here we propose that there are three main kinds of metadata: schema,
navigational and associative.
Schema metadata is an intentional description of extensional instances.
Typically a schema consists of: database {name, size, security
authorisations}, attributes {name, type, constraints}. Some of
the constraints concern the attribute domain, some are inter-attribute
and as such may express relationships. The schema intension has
a formal logic relationship to the data instances. This is important
in ensuring data quality. It also provides a formal basis for
systems.
Navigational metadata provides information on how to get to an
information resource. Mechanisms include: filename, DB name +
navigational algorithm, DB name + predicate (query), URL (Uniform
Resource Locator), URL + predicate (query) or various combinations
of them. They may also be obtained via a web-indexing mechanism
(such as [AltaVista], [ExCite]
) which themselves make extensive
use of metadata. Navigational metadata has no formal logic relationship
to the data instances.
Associative metadata provides additional information for application
assistance. The assistance may improve performance, accuracy or
precision of the system and / or provide assistance to the end-user
through a domain aware supportive user interface. The main kinds
of associative metadata are:
- descriptive: catalogue record (eg [Dublin Core])
- restrictive: content rating (eg PICS) or security, privacy (cryptography,
digital signatures) [W3C]
- supportive: dictionaries, thesauri, hyperglossaries [VHG], domain
ontologies eg [PROTÉGÉ]
Associative metadata usually does not have a formal logic relationship
to data instances although there may be systematic association
relationships.
Metadata and Dataweb
In order to combine the benefits of universal access (Web) with
the benefits of data managed and with structure and quality in
a database various teams have worked on linking Web and Database
systems. CLRC-RAL was early into this field and experimented with
several techniques since 1993, currently basing the departmental
web on Microsoft ASP technology. Now much of the information available
over the web is held and managed within databases linked to the
web through CGI (Common Gateway Interface) and scripts in a language
such as Purl or Tcl.
The data in these structured databases behind a web interface
is essentially invisible to web indexing systems such as [AltaVista]
or [Excite]. Since this is usually structured, managed, high quality
data its use might be preferable to authored html pages. The problem
is how to make it visible to web-indexing or information-cataloguing
systems, in a way that is universally acceptable and utilised.
Conclusion
The key to the Future of Information Systems is Metadata. However,
there are serious issues to be addressed:
- standard form for metadata: the W3C RDF is general and uses XML
as the language - is this sufficient?
- sub-forms of metadata by application domain: will they all be
based on the same basic data model and language to allow cross-domain
interoperation?
- Progressively dataweb technology is being adopted; is there a
standard mechanism for making such structured and hopefully quality
information sources visible on the web through metadata?
The set of articles within this special theme in ERCIM News document
all the aspects of metadata mentioned above. They cover data quality,
query quality, answer quality and heterogeneous information integration.
They exhibit aspects of schema, navigational and associative metadata.These
Principals can be seen in the following ways: KINE uses a knowledge
based programming approach to hold metadata (see article) whereas GEN.LIB from CRCIM is using a programming library approach
(see article). SARI and the W3C (page 25) work is concerned with RDF.
The use of metadata for enhanced query is discussed in the articles
from ETH Zurich (see article) and the joint INRIA/FORTH work on Artemis (see article), the latter also using metadata for integration. Articles from
ICS/FORTH on health care (see article) and RAL on ERGO and CERIF (see article) describe applications using metadata and Webstore from GMD (see article) details middleware for integration.
Several of the articles describe the use of intelligent agents
(with associated metadata persistent stores) for reconciling heterogeneity
and for assisting at other interfaces (eg query) - by many this
is seen as the way forward for use of metadata to improve the
effectiveness of information systems in a global setting.
Please contact:
Keith Jeffery - CLRC
Tel: +44 1235 44 6103
E-mail: kgj@rl.ac.uk