ILEX: a Lexical Resource for Italian
by Giacomo Ferrari
The objective of Natural Language
Processing is the building of computational systems to handle natural language,
such as information and knowledge storage and retrieval, document processing,
and generation of texts and abstracts, or to use natural language as a
language for access and interaction with computers.
After several decades, during which researchers have carried out theoretical
studies and built prototypes of natural language processing systems, it
has become clear that such products are in principle feasible, but the
realization of large and complex natural-language-based services is prevented
by the unavailability of the linguistic knowledge necessary to make these
systems work. In particular, procedures for language analysis or generation
do not have adequate resources from which to acquire syntactic knowledge,
i.e. grammar rules expressed in a computer oriented formalism, and lexical
knowledge, i.e. large computational dictionaries. Thus, in recent years,
much effort has been devoted to the building of such linguistic resources,
as well as other kinds of linguistic data banks, which can be used as support
in the development of refined and exhaustive knowledge of a language.
The Italian Situation
In Italy, as in other countries, lexical databases are in a more advanced
state of development than other language resources, such as computational
grammars, tagged corpora (textual data bases where words and segments of
text are classified and labelled accordingly), or tree banks (repositories
of syntactic trees for fragments of natural language sentences).
Efforts aimed at building a repository of lexical knowledge for Italian
date back to the second half of the '60s, when the construction of a Machine
Dictionary of Italian was began by Antonio Zampolli and his group in Pisa.
This first computational dictionary consists of a list of roughly 100,000
entries - fully tagged for their lexico-syntactic and usage categories
- from which nearly one million of forms can be (semi-) automatically derived.
About 250,000 definitions are also provided.
Similar enterprises have also been carried at the Universities of Venice
and Turin, although neither has worked on such a large scale. In the following
decades, there were several initiatives in a number of research centres
including some Italian companies, such as Synthema, Thamus, Olivetti, CSELT,
Sogei etc. However, the resulting products were expected to satisfy only
very specific requirements and thus were not designed to be generalizable;
in addition, in all these projects, the number of words treated has been
relatively small if compared with the Italian Machine Dictionary.
Recently, the European Community has funded a number of projects on
natural language, which - either as a primary result or as a by-product
- have produced lexical data for Italian. This is the case of PAROLE, which
is producing a set of about 20,000 words morpho-syntactically coded
according to the guidelines given by EAGLES and GENELEX, SPARKLE, which
has as objective the implementation of tools for automatic lexical acquisition,
and CRISTAL, a project on intelligent information retrieval, which uses
a dictionary of 40,000 forms from 8100 entries. (See Ercim News, No. 26,
section on Computational Linguistics.)
The assumption that a computational dictionary can be used as a reference
list of common knowledge lies at the base of the American project CYC,
which has no correspondence for Italian, while another American project
WordNet, whose result is a monolingual American English lexicon, accessible
on the Web, where words are connected by conceptual links, has stimulated
the setting up of EuroWordNet, funded by the European Community, which
operates on 30,000 nouns and 15,000 verbs, for four European languages
including Italian.
The ILEX Project
Experiences acquired so far in the building of lexical repositories
for different purposes and using different methodolo-gies have highlighted
the need for a national language resource that is complete at all levels
of lexical description and that can be employed in different kinds of natural
language applications without serious porting efforts.
The ILEX project was thus begun two years ago by a consortium formed
by the Istituto per la Ricerca Scientifica e Tecnologica in Trento, and
the Universities of Venice, Turin and Vercelli. The aim is to build a computational
lexicon with a minimum core of 30.000 entries, fully coded in accordance
with the most recent international standards in the sector. The following
information will be encoded:
- Part-of-Speech (POS) and syntactic-semantic classes, plus all the codes
necessary to accurately describe the morphological behaviour of a word;
this information will be used by procedures for morphological analysis
and generation
- syntactic subcategorization codes, describing the syntactic behaviour
of entries, especially with respect to their argument structure and dependent
constituents
- conceptual relations between words, in the style of WordNet, as well
as compositional semantic information which can be used by programmes for
the semantic analysis of sentences.
Other information will be encoded modularly, in separate but compatible
files. The aim of ILEX is not only to create a repository of lexical knowledge,
but also a distributed access system, which takes advantage of Internet
facilities to offer integration and modularity. The dictionary will be
easily accessible at various levels (ie entire vocabulary or sublexica)
and updating procedures will be automatic, easy, and fast.
Please contact:
Giacomo Ferrari University of Vercelli
Tel: +39 161 228224
E-mail: ferrari@zeus.vc.unipmn.it