Contrastive Indexing of Full Text Documents
by Laurent Romary and Patrice Bonhomme
In the context of the general Aquarelle scenario, the creation of
folders allows a user to put together pieces of information which he considers
useful for his own purpose. In particular, he may include textual fields
which in turn have to be made accessible for further retrieval. To this
end, we designed a full text indexing method which, rather than providing
an absolute set of indexes for each textual field, aims at contrasting
each of them to the other fields the user might point to either in the
same folder or within other folders he has created or extracted from an
Aquarelle server.
The basic idea behind the contrastive indexing method is to consider
a given document or rather the set of tokens it contains as a sample taken
from the set of all tokens belonging to the reference corpus of documents
it belongs to. The frequency of the token within the document can then
be compared to the expected distribution computed from the reference corpus,
in order to evaluate whether it is inkeeping with it, or on the contrary
too far from it not to be interpreted as indicating a particular relevance
for the document. For each document, we thus compute a set of so-called
contrasting tokens which is a good indication of its informational content
relatively to the contents of the documents it is compared to. As a consequence,
this method has different interesting properties which both from a linguistic
and information retrieval point of view makes it a good option for an optimal
full text indexing mechanism:
- there is no need for a specific list of grammatical words (ie stop-list)
which have to be avoided during the indexing process, since these are generally
subject to a uniform distribution among a set of documents taken from the
same field or belonging to the same textual genre (eg historical descriptions,
newspaper articles etc)
- furthermore, not only grammatical words are being dropped from the
candidate list of indexing terms, but also those words which, although
being meaningful from an absolute point of view, are uniformly represented
within the reference set of documents and thus are not relevant for the
description of any specific one. For instance, within a set of documents
describing historical buildings words like architecture or architectural
are not likely to appear as contrasting tokens:
- as a consequence, the method can be seen at least to a large
extent as being language independent, as it relies on a local model
of linguistic distribution. Still, even if this has not been specifically
tested, we might expect that for highly inflectional languages the result
might not be as good as those observed for French and English, unless a
lemmatizing phase is considered
- finally it can be observed that a given document can be indexed differently
according to the set of documents which contextualizes it, thus providing
a way to account for the differents viewpoints users might project onto
it when building up folders.
The full text indexing module has been considered as a semi-automatic
process provided to the user during the folder editing stage. As a matter
of fact the user always has the possibility to edit and validate the set
of candidate terms before these are actually inserted within the folder
itself.
Given the robustness of the method as we have observed it in our first
trials within the Aquarelle project, we have thought of extending it towards
a general mechanism of content identification within a set of more less
homogeneous documents. Indeed, what results from the contrastive indexing
process is a kind of thematic description of the document in comparison
with a given reference which acts as a background, hence the possibility
to iteratively group together documents with similar descriptions and further
to build up a thematic map of the whole reference database. This concept
has been recently applied within a project funded by the DGLF (Délégation
Générale à la Langue Française) aiming at automatically
producing thematic descriptions of a given web site. The contrastive indexing
method, combined with a hierarchical clustering algorithm has allowed us
to produce topic maps of a given web site independently of its actual language
or content domain.
Please contact:
Laurent Romary and Patrice Bonhomme - LORIA
Tel: +33 3 8359 2037
E-mail: {romary,bonhomme}@loria.fr