DELOS Workshop on Cross-Language Information Retrieval
by Páraic Sheridan
The third workshop of the DELOS working group, on the topic of 'Cross-Language
Information Retrieval', was hosted by ETH Zurich, 5-7 March 1997. DELOS
is a working group funded by the IT Long Term Research programme of the
European Commission to study and investigate existing and emerging technologies
and issues relevant to digital libraries.
The DELOS Working Group is just one of a series of ERCIM-sponsored initiatives
aimed at promoting research and operational activities in the Digital Library
field. The DELOS consortium consists mainly of members of ERCIM institutes.
As was borne out by many of the workshop presentations, many research
projects addressing issues of digital information repositories in Europe
must deal with information in several languages, even when multi-lingual
or cross-language information retrieval is not a central theme of the project.
We distinguish multi-lingual information retrieval as involving several
languages, though a user's search query is always evaluated against only
those documents in the query language, and cross-language information retrieval
as the case where a user's query may retrieve documents in languages other
than the language of the query.
A total of 27 participants attended the workshop, representing 9 different
European countries, as well as invited speakers from the United States
and Korea, who helped to broaden the discussions beyond the European perspective.
Apart from the geographical diversity of the participants, backgrounds
in Information Retrieval, Computational Linguistics, Lexicography, Controlled
Vocabulary Thesauri, and Internet Technology, also helped to bring many
different viewpoints to the discussions of the work presented.
To set the scene for the workshop, Doug Oard of the University of Maryland
gave an overview of Cross-Language Information Retrieval in the USA, including
a schematic breakdown of the various approaches: corpus-based (parallel,
comparable or unaligned corpora) versus knowledge based (dictionaries or
ontologies). He presented a substantial amount of US-based research on
cross-language retrieval, and showed that current approaches have demonstrated
performance in the range of 50% to 75% of the performance of the comparable
monolingual retrieval task. He was followed by Sung-Huyn Myaeng of the
National University Taejon, Korea, who gave an in-depth presentation of
the particular problems of working with Asian languages, including the
use of different scripts, the problem of word segmentation and the similar
problem of compound noun analysis. This was appropriately followed by Martin
Duerst, University of Zurich, who, in recognition of the increasing role
of the World Wide Web in this area of research, detailed the emerging HTTP
and HTML standards for supporting multi-script and multi-language information
on the Web.
Other presentations from European researchers focussed on the approaches
being adopted for cross-language and multi-language retrieval in various
projects such as Twenty-One,
MULINEX, Aquarelle,
ILIAD and MedExplore, some of which are funded by the European Commission.
A common sentiment expressed was that, even in cases where multilinguality
was not a core concern of the project consortia, it was a topic that had
to be addressed given the European dimension. We therefore saw some novel
approaches to cross-language retrieval being taken by these researchers.
An important parallel theme was also the identification, conflation and
use of multi-word terms for cross-language retrieval, given the observation
that these can serve to greatly reduce translation ambiguities.
From the Information Retrieval point of view, David Hull of Rank Xerox
research centre, Grenoble, France, presented a model for weighted Boolean
retrieval for cross-language retrieval, and Páraic Sheridan of ETH
Zurich presented a method of using a retrieval model for building information
structures called similarity thesauri for cross-language retrieval. The
presentation of similarity thesauri showed how this approach has been implemented
also for cross-language retrieval of speech documents, and a demonstration
of the EuroSpider retrieval system was given (http://www.eurospider.ch/).
Approaches from the Computational Linguistics perspective were presented
by Carol Peters and Eugenio Picchi of CNR, Italy, who showed how the use
of comparable corpora together with lexical resources could bring to light
useful translation equivalences for cross-language retrieval, and Piek
Vossen of the University of Amsterdam presented the EuroWordnet project
(http://www.let. uva.nl/~ewn/) which is augmenting the Princeton Wordnet
of English with wordnets in Dutch, Italian and Spanish. The workshop concluded
with a discussion of the important issue of evaluating different approaches
to cross-language information retrieval, and the fact that this year's
Text Retrieval Conference (TREC 6) will include a track evaluating cross-language
retrieval was highlighted as highly significant.
The next DELOS workshop will address Image Indexing and Retrieval, and
will take place in Pisa Italy, 28-30 August 1997, in conjunction with the
First European Conference on Research and Advanced Technology for Digital
Libraries (see announcements on page 49).
Please contact:
Costantino Thanos - IEI-CNR
Tel: +39 50 593429
E-mail: thanos@iei.pi.cnr.it