Eurosearch: a Federation of European Search Engines
by Martin Braschler, Mounia Lalmas, Luigi Madella and Carol Peters
The objective of the EuroSearch project is to build a federation
of European search engines. The main aims of the federation are
to join forces in order to be able to better compete with global
search engines, to enhance the visibility of European web sites,
and to help to preserve European language and cultural diversity.
The technologies under development will provide linguistic support
for querying over different search services and enable the automatic
generation of catalogues, best reflecting local cultures within
the federation.
The decision of the EuroSearch consortium of industrial (Italia
On-Line, Pisa; CINET, Barcelona; EuroSpider Information Technology,
Zurich) and academic partners (CNR, Pisa, and Dortmund University)
to build a federation of European search engines originated from
the consideration that the World Wide Web is still dominated by
US culture and so far little effort has been put into promoting
European web sites. A study of the incoming traffic of the services
provided by the EuroSearch partners determined that about 70%
comes from the same country or from countries using the same language;
of the outgoing traffic more than 50% is directed to the US, while
almost all the rest remains in the country of origin. This situation
has been analysed as depending mainly on language barriers between
European countries, the poor multilingual support in traditional
search engines, and on the US cultural domination of most popular
web catalogues.
The EuroSearch project thus aims at helping to restore linguistic
and cultural equilibrium on the Web by building a pan-European
federation of national search and categorization services. The
main objectives of the federation are to:
- promote traffic across Europe by exchanging links and sharing
services
- provide language support for query translation
- provide tools for automatic categorization in order to overcome
the high costs of traditional catalogues, still affordable only
by big international organisations.
The Cross-Language Approach
The aim of the EuroSearch distributed, multilingual service is
to permit users to enter queries in their own, or their preferred
language, and to carry out search and information retrieval over
some or all of the federations national sites.
Differences in the partners document collections and indexing
mechanisms have led to the implementation of different search
strategies, depending on the collection to be queried. The cross-language
search component of EuroSearch thus activates two distinct types
of searching:
- Query translation using a multilingual lexicon; this employs the
pivot language concept and semantic indicators are assigned to
polysemous words to permit interactive sense disambiguation. Queries
can also be expanded using corpus-extracted data.
- Similarity thesaurus technology; a multilingual similarity thesaurus
contains entries linking terms in one language to a list of similar
terms in another, each assigned with a similarity value based
on statistical co-occurrence, ie basically how often the terms
co-occur in similar texts taken from training data.
The languages covered are currently German, Italian and Spanish,
plus English. The two approaches are integrated through the development
of common translation server interfaces and data exchange formats;
this will facilitate future extensions of the Eurosearch components.
A preliminary simplified prototype of a Translation Server has
been developed and integrated in the Arianna search engine, allowing
queries in Italian to be formulated and directed to Alta Vista.
This server will be extended with the addition of a corpus-based
query expansion mechanism. In 1999 the integration of the linguistic
resources on all the federated services will be completed.
The Automatic Categorization Technology
Another important goal of the project is to facilitate the creation
of Web catalogues by developing techniques for the automatic categorization
of documents. In this way, even small corporations will be able
to develop their own catalogues.
The categorization approach is grounded on an automatic textual
analysis of web documents associating weighted terms with documents.
The determination of the weighted terms is based on the description-oriented
indexing approach developed at the University of Dortmund. It
takes into account features:
- specific to web documents (whether a term appears in a title,
a heading, or is highlighted)
- standard to text documents (term frequency).
The weights are probabilistically determined using the Least Square
Polynomial (LSP) approach and a test-bed of pre-categorized documents
taken from the Computers and Internet part of the Yahoo! catalogue.
This approach produces two main results:
- the automatic classification of new documents into appropriate
categories
- the determination of documents that belong to given categories.
The approach is fully automatic, and is portable to the various
languages involved in the federation. We are currently applying
our techniques to German web documents from the DINO-online catalogue.
A preliminary on-line prototype is now running, and an engineered
version is available in the Arianna catalog. This is one of the
first examples in the world of automatically generated catalogues
available on the Web.
For further information and demos, see the EuroSearch Web site
at: http://eurosearch.iol.it/
Please contact:
Luigi Madella - Italia Online
Tel: +39 050 944258
E-mail: l.madella@pisa.iol.it