Tools and Techniques for Digital Libraries

ERCIM News No.27 - October 1996

Tools and Techniques for Digital Libraries

by Jacques Ducloy

Two new and important institutes have been recently created in the region of Lorraine: INRIA-Lorraine, which has strengthened a traditional research area in Computer Science, and INIST, a national documentary centre, which has created a new activity in Technical and Scientific Information. In order to initiate a common R&D activity between these two Institutes, it is necessary to solve two complementary issues: to help information specialists to handle the new technologies; to encourage computer scientists to work on distributed documentation systems, ie on Digital Libraries.

With this common objective in mind, INRIA Lorraine, in partnership with the Computer Science Research Center in Nancy, is now working in three main directions: the first activity is dedicated to the development of basic tools for scientific or technical information management and document engineering; a second activity is based on the results of the first and focuses on Information Retrieval Systems; a third important activity is concentrated on the study of linguistic and terminological tools.

A main result of the first topic is the DILIB (Document and Information LIBrary) project: a workbench which uses SGML for coding information or designing tool interfaces. A guiding principle is that all kinds of information, whether external or internal, is coded according to the SGML standards.

DILIB's kernel is a toolkit whose basic part consists in a SGML tree handling library and a set of Unix-like commands for handling SGML records. This permits the design of generic tools which can be applied to any kind of information. For instance, using an SGML path mecha-nism, a selection of all Unimarc records having Dunod as publisher could be written like this:
SgmlSelect -g unimarc/f210/sc#=Dunod

The same filter (SgmlSelect) can be applied on an inverted file to select authors whose frequency is less (or greater) than a given threshold. This 'SGML philosophy' will be applied to many areas relating to Digital Libraries, for instance in designing Information Retrieval Systems. DILIB contains several basic components (such as modules to manage large sets of records) which can be combined to build complex WWW applications linking hetero-geneous sets of data. An interesting feature is the automatic generation of a 'hierarchical type of thesaurus' using clustering mechanisms. Other techniques using neuronal approaches have been tested (in the framework of the INIST NEURODOC project for instance).

In order to handle real Digital Libraries, ie data bases which contain full text records or articles, we are now working using DIENST repositories. DIENST, a distributed Digital Library protocol, is being developed by Cornell University in the framework of NCSTRL (Networked Computer Science Technical Reports Library). Last year, at an ERCIM meeting in Budapest, we presented a first sample of coupling a DILIB browsing graph with a DIENST repository.

Another result of the SGML approach is a strong interoperability which allows a library belonging to a small organization to be integrated in several different networks. For example, the library of our laboratory is a member of the ERCIM network of Digital Libraries (using DIENST and RFC 1807) and GRISELI, a French project for collecting technical reports dealing with the ISO 12083 cataloguing format.
However, the biggest challenge for a Digital Library on the international scene is to master multilinguality and to be able to allow a given community of end users to search large volumes of data using a common terminology base. We are now working on this topic within the framework of the MedExplore project (which is funded by CNRS GIS Sciences de la Cognition). Its aim is to allow a medical expert to search through various kinds of data including structured data (for instance the MEDLINE database and the UMLS thesaurus) and raw textual information coming from INTERNET.

A simplified example of the strategy we intend to apply is the following. If we want to generate a hypertext from a set of raw documents, the first step and a key point of our approach is to build up a significant set of some hundred keywords. This collection of terminological items can be produced in several ways, for instance using a clustering mechanism. We could then apply complementary tools and techniques to produce a browsing graph (for instance by extracting significant relationships from a large thesaurus), and an indexing engine. For each operation, the limited number of keywords allows a choice between purely automatic or human assisted processing. This kind of technique, coming from bibliometric or scientific surveys, can be used in many areas dealing with indexing or multilinguality in the Digital Libraries sector; especially if we want to enforced a federative way of working.

Simplified strategy for hypertext generation.
In the framework of ERCIM, we are submitting two proposals: the first one aims at setting up a European Digital Library on Computer Sciences and Applied Mathematics; the second intends to create a Euro-Mediterranean network for information on public health. Both these projects must handle multilinguality a key point for European Digital Libraries.

Please contact:
Jacques Ducloy - INRIA Lorraine & CRIN CNRS
Tel: +33 83 59 30 38
E-mail: Jacques.Ducloy@inria.fr

return to the contents page