Building Harmonised Semantic Lexicons1
by Nicoletta Calzolari
SIMPLE is a project sponsored by EC DGXIII in the framework of
the Language Engineering programme. To our knowledge, this project
represents the first attempt to develop wide-coverage semantic
lexicons1 for a large number of languages (12), with a harmonised
common model that encodes structured semantic types and semantic
(subcategorisation) frames. Even though SIMPLE is a lexicon building
project, it also addresses challenging research issues and provides
a framework for testing and evaluating the maturity of the current
state-of-the-art in the realm of lexical semantics grounded on,
and connected to, a syntactic foundation.
Many theoretical approaches are currently tackling different aspects
of semantics. However, such approaches have to be tested i) with
wide-coverage implementations, and ii) with respect to their actual
usefulness and usability in real-world systems both of mono- and
multi-lingual nature. The SIMPLE project addresses point i) directly,
while providing the necessary platform to allow application projects
to address point ii).
SIMPLE is coherent with the strategic EC policy that aims at providing
a core set of language resources for the EU languages and should
be considered as a follow up to the PAROLE project (see http://www.ilc.pi.cnr);
SIMPLE adds a semantic layer to a subset of the existing morphological
and syntactic layers developed by PAROLE. The semantic lexicons1
(about 10,000 word meanings) are built in a harmonised way for
the 12 PAROLE languages. These lexicons1 will be partially corpus-based,
exploiting the harmonised and representative corpora built within
PAROLE. In this way, the semantic encoding will respect actual
corpus distinctions. The lexicons1 are designed bearing in mind
a future cross-language linking: they share and are built around
the same core ontology and the same set of semantic templates.
The base concepts identified by EuroWordNet (about 800 senses
at a high level in the taxonomy) are used as a common set of senses,
so that a cross-language link for all the 12 languages is already
provided automatically through their link to the EuroWordNet Interlingual
Index.
The Model
In the first stage of the project, the formal representation of
the conceptual core of the lexicons1 was specified, ie the basic
structured set of meaning-types (the SIMPLE ontology). This
constitutes a common starting point on which to base the building
of the language specific semantic lexicons1. The development of
12 harmonised semantic lexicons1 requires strong mechanisms for
guaranteeing uniformity and consistency. The multilingual aspect
translates into the need to identify elements of the semantic
vocabulary for structuring word meanings which are both language
independent but also able to capture linguistically useful generalisations
for different NLP tasks.
The SIMPLE model is based on the recommendations of the EAGLES
Lexicon/Semantics Working Group (http://www.ilc.pi.cnr.it/EAGLES96/rep2)
and on extensions of Generative Lexicon theory. An essential characteristic
is its ability to capture the various dimensions of word meaning.
The basic vocabulary relies on an extension of qualia structure
for structuring the semantic/conceptual types as a representational
device for expressing the multi-dimensional aspect of word meaning.
The model has a high degree of generality in that it provides
the same mechanisms for generating broad-coverage and coherent
concepts independently of their grammatical/semantic category
(entities, events, qualities, etc.).
In order to combine the theoretical framework with the practical
lexico-graphic task of lexicon encoding, we have created a common
library of language independent template-types, which act as
blueprints for any given type - reflecting the conditions of
well-formedness and providing constraints for lexical items belonging
to that type. The relevance of this approach for building consistent
resources is that types both provide the formal specifications
and guide subsequent encoding, thus satisfying theoretical and
practical methodological requirements.
The large number of languages covered by SIMPLE is reflected in
the size of its Consortium: Università di Pisa (coordinator: A.
Zampolli), Erli (now Lexiquest)-Paris, Institute for Language
and Speech Processing-Athens, Institut d'Estudis Catalans, University
of Birmingham, Univ. of Sheffield, Det Danske Sprog-og Litteraturselskab,
Center for Sprogteknologi-Copenhagen, Språkdata-Göteborgs Universitet,
University of Helsinki, Instituut voor Nederlandse Lexicologie-Leiden,
Université de Liège BELTEXT, Centro de Linguística da Universidade
de Lisboa, Instituto de Engenharia de Sistemas e Computadores-Lisboa,
Fundacion Bosch Gimpera Universitat de Barcelona, Institut für
Deutsche Sprache, Istituto di Linguistica Computazionale - CNR
Pisa, University of Graz.
Please contact:
Nicoletta Calzolari - ILC-CNR
Tel: +39 050 560 481
E-mail: glottolo@ilc.pi.cnr.it