ERCIM News No.26 - July 1996 - INRIA

Analysing Information from Large Documentary Bases - The ILC Project

by Yannick Toussaint and Jean Royaute

The ILC Project (Infométrie, Langage et Connaissance) is a collaboration between the DIALOGUE Team of the INRIA-Lorraine & CRIN-CNRS Laboratory in Nancy and the Infometry Research Program of the INIST-CNRS Laboratory. It aims at partly modelling and structuring the knowledge written in large documentary bases. This modelling will facilitate information analysis. The project is part of the ILIAD project, supported by the French National Cognitive Science Program (GIS 'Science de la Cognition').

The tools and methods currently being developed in the ILC project should enable a human operator to collect the information content of a text without reading it sequentially. The information
analysis is the step following the information retrieval process and is based on methods particular to informetrics, using statistical techniques of data analysis. They are combined with approaches used in large corpora linguistics for identifying term structures and locating them and the relations between them in the texts.. Techniques from artificial intelligence are called upon in order to collect and organise the knowledge that emerges from these linguistic and statistic processes.

We assume that the major part of the information in technical texts is located in noun phrases. Therefore, our strategy for analysing information relies upon performing robust and partial linguistic processing based on term and noun phrase identification. Combining statistic and linguistic methods, we search the texts for the conceptual links that exist between terms in the domain. We pay special attention to the identification of a set of linguistic connectors and to certain domain-specific predicative structures.

We divided the project into two phases. The first, which is now near completion, consists in building an automatic process for the recognition of terms in texts from a thesaurus, and of the classification of these texts following criteria of term co-occurrence.

The second phase is aimed at identifying structures in the texts, predicatives or not, which could reveal a conceptual link between two terms. This should lead to the construction of a knowledge base with the terms and their conceptual relations whose main structure is the initial thesaurus.

Searching terms in corpora
and classifying them

The first phase of the project combines three different stages :

a probabilistic approach, which relies on morphological and weighted contextual rules, in order to tag terms from a thesaurus. The resource we use is the AGROVOC thesaurus, a trilingual thesaurus referring to the agricultural domain. We re-accentuated the French entries using a semi-automatic procedure. The Brill tagger was then trained on this corpus
a computational linguistics approach focusing on the partial treatment of noun phrases. It relies on the identification of terms and their variation in corpora. For example, storage of medical data is a variant form of medical data storage. This process uses the FASTR analyser (developed by C. Jacquemin, IRIN-Nantes, France) working with unification grammars written in the PATR-II formalism
a statistical approach using the NDOC tool (developed at INIST-CNRS, Nancy), based on term cooccurrence in corpora. A statistical index, called Equivalence, gives the degree of association of two terms. A hierarchical classification algorithm then allows the definition of clusters of close terms.

Conclusion and future work

In order to integrate these three stages, we had to develop robust linguistic tools such as a lemmatiser for French. Identical tools were developed for English and the results of the experimentations on the same domain are very close to those for French.

The second phase of the project is being started next month and we will then focus on three points :

the specification of how predicate structures could be used to make explicit relations inside a cluster or between two clusters
the representation model that could be used to structure the information
the structuring of the data following the model to obtain a valid hierarchical classification.

See also:
http://www.loria.fr/exterieur/equipe/dialogue

Please contact:
Yannick Toussaint - INRIA Lorraine
& CRIN-CNRS
Tel: +33 83 59 20 91
E-mail: Yannick.Toussaint@inria.fr

return to the contents page