FileTrack (Multilingual IR) Functional Spec

Cormac Redmond — credmond85 /at/ gmail

Abstract

This work proposes and evaluates a probabilistic multilingual retrieval system. The completed system will provide a user with summarized search results of a multilingual text document search. Relevance feedback will also be implemented to make searching more accurate. It is intended for use for local searching on individual systems. Users will interact with the system through a typical looking user interface.

Finally, this work proposes a means to designing and evaluating the Information Retrieval (IR) system.

1. Introduction

1.1 Overview

The proposed system is a probabilistic multilingual retrieval system. The completed system will allow a user to search multilingual text files on his/her computer. Summarization is implemented to aid the user in making a relevant choice. The system will be interacted through a typical user interface, allowing users to search and view results. Search queries can be of any language and all documents will be searchable.

1.2 Users

Users are expected to be people working with large volumes of documents in multiple languages who need such a facility, whether in a personal or business orientated capacity.

2. Functional Description

Multilingual Information Retrieval (MLIR)refers to the ability to process a query for information in any language, search a collection of objects, including text and to return the most relevant objects.

A typical MLIR is composed of a number of elements, as outlined below.

2.1 Pre-processing, Indexing & File Structures

An inverted index is an index structure storing a mapping from words to their locations in a document or a set of documents. It is the most popular data structure used in document retrieval systems. The inverted index file will be used in conjunction with a users search term to determine which documents contain a search word and where within those documents they lie. It will be implemented using a hash algorithm and hash table for the most efficiency.

Upon installation and regularly after that, the system will create and update the index file. Each collection will consist of a set of documents in a particular language and each collection will have its own index file.

Creating and updating the index file is a complex process. It poses the problem of identifying the language of a document. It also poses a further problem of documents containing multiple languages. Automatic language identification is beyond the scope of this system and will be handled using third-party software. Rosette Language Identifier [1] is an example of software written for this exact purpose, allowing applications to use its APIs.

Using this, or similar third-party software, documents containing multiple languages will be added to multiple collections and hence exist in multiple index files.

In order to create the index files the system will process through each text file on the system, word for word.

Before referring to the index file, the system must first pre-process user-entered search words.

Pre-processing search words is known as conflation – a process for removing the commoner morphological and inflexional endings from words in English. For English requests, the system will make use of a popular algorithm known as the “Porter Stemming” algorithm to carry out conflation. For other languages, the system will implement the stemming algorithms provided by Snowball [2].

The system will also disregard “stop words”, common words such as ‘a’, ‘the’ ‘is’, etc. A list of stop words for many languages can be found at Snowball [2], and will be referred to when choosing stop words.

2.2 Best Match Searching using the Probabilistic Model

Once the index file is created, the system will use what is known as the Probabilistic Model system to retrieve and rank results from each collection of documents (by referring to the appropriate index files). This model has proven to retrieve the best results for novice users. It consists of an algorithm/weighting scheme which determines which documents are relevant and more importantly, how relevant they are compared to others.

The probabilistic model query-document matching score ms(j) for document j can be determined using:

where w(i, j) is a term weighting scheme.

In this case, the weighting algorithm to be used is the “Okapi BM25 formula”:

where cw(i,j) is the combined weighting scheme, tf(t, f) is the number of occurrences of term t(i) in document d(j), and where ndl(j) is defined as:

where dl(j) is the total number of term occurrences in document d(j).

Virtually all the major systems in TREC [9] now use the “Okapi BM25 formula” (according to Vector and Probabilistic Ranking [3]) which incorporates the Robertson-Sparck Jones weights. This allows normal term weighting but takes user relevance feedback into consideration. Relevance feedback refers to a user explicitly telling the system which documents are relevant, thus allowing the system to re-order and re-search the documents, which then hopefully yields an increase of relevant documents.

Terms are reweighed by replacing the cfw(i) part with rw(i), which incorporates the Robertson-Sparck Jones weights is as follows:

where r(i) is the number of known relevant documents term t(i) occurs in, N is the total number of documents in the collection archive, R is the total known relevant documents in the collection archive, and n(i) is the number of documents term t(i) occurs in.

All terms in relevant documents are potentially useful. These terms are first ranked using the Robertson offer weight ow(i):

The top ranking terms will then be added to the original query.

Once the above algorithms are applied, the system will have a matching score for each relevant document in a particular collection.

2.3 Multilingual Searching

In order to facilitate multilingual searching, the above Probabilistic Model, described in Section 2.2, will be applied to each collection upon each search. Each collection refers to a set of documents in particular language.

In order to search a collection, the system will translate a search request into a given language chosen by the user, and perform a search on the relevant collections. This has a number of advantages:

  • No storage overhead
  • Translation will be fast

This can however, sometimes translate poorly and yield poor results. But, as the system is designed for single PC usage, translating whole documents into a given language, e.g. English, wouldn’t be sustainable with regards to storage.

The user will be given a choice as to which language collections to search and the results will be displayed as separate ranked lists.

Translation of queries will be handled by reliable third-party software, before being pre-processed as described in Section 2.1. Many examples of such software exist.

2.4 Summarization

After the system has successfully retrieved a ranked list of documents, the next step will be to summarize these documents, in order to provide the user with a better insight into a document’s contents.

No concrete summarization methods exist as of today, however some have proven quite promising. The best summarization is one customized to reflect the information need expressed in a query.

According to the article “Advantages of Query Biased Summaries in Information Retrieval” [4], the use of query biased summaries significantly improves both the accuracy and speed of user relevance judgments information within the documents. As a result, it may be easier for users to decide on the relevance of the retrieved documents to their queries.

Although there have been attempts to produce coherent summaries by language generation and artificial intelligence techniques (Jacobs & Rau [5]; McKeown & Radev [6]), they are capable of processing texts only within a narrow domain whose characteristics are predictable and well understood (e.g. news stories, financial and commercial reports). There is not enough evidence that such systems will be able to manipulate domain independent text in the foreseeable future. It is for this reason that this system will implement the recently-new query-based summarization.

2.5 Query-Based Summarization

Query-based summarization works by finding and weighing sentences based on their location and structure within the document in conjunction with the users search query.

The structure of a document often gives a clue to what sentences are relevant.

Title

The title of a document generally expresses the main topic covered in the document, and is assigned a positive weight.

Location

In agreement with Brandow [7], who suggested that improvements can be made by heavily weighting sentences appearing at the beginning of a document, the first two sentences of an article are weighted.

Section headings are also given a heading weight.

Term Occurrence

The system locates clusters of significant words (Luhn [8]) within sentences, and assigns scores to them accordingly. To achieve this, the system firstly uses a term occurrence (TO) variable within each document to further assign weights to sentences. According to “Advantages of Query Biased Summaries in Information Retrieval” [4], a significance of 7 is appropriate to determine a term’s significance. For larger documents this should be increased.

The system will then use Luhn’s approach [8], of determining a relationship between two significant terms within a sentence. We define two terms as being significantly related if both of them are significant, and between them are no more than 4 non-significant words.

Query Biasing

Finally, the essence of query-based summarization is creating a weight based on a users search term. Here, each sentence is given an additional weight. It is determined by the number of query words which occur in a sentence – the more occurrences, the higher the score.

The actual score of a sentence, in relation to a specific query, will be derived by dividing the square of the number of query terms included in that sentence by the total number of terms in the specific query.

This score is now added to the other scores.

Once applied, the system will now have a ranked and structured list of summaries which should appear underneath each search result.

3. Implementation and Evaluation

3.1 Development

The system will be developed in a high-level object orientated language, such as Java, C++ or C#.

The user interface will be a typical looking GUI, providing the user with a choice of the following:

  • Search location
  • Query language choice and query submission
  • Language(s) to include in search

Upon search completion, the user will be provided with a search result list, containing a summarization for each ranked document. Optionally, the user can choose the most relevant document in order to re-define the search for more accurate results.

Determining Document Types

In order to include a document in an index file the system must be able to determine what a text document is, and what is not. It must also be able to access the content within a document. This can be done using the Win32 API, or third-party APIs, such as those provided upon installation of Microsoft Word, Adobe Acrobat, Open Office, etc. An equivalent exists for a UNIX environment.

Automatic File Indexing

The system will run a background process, uncontrolled by the user, to scan the file system and update the index file regularly.

3.2 Evaluation

Once the system is implemented, it will be evaluated. Evaluation will be performed against a test collection provided by TREC [9].

Each of these collections consists of a set of documents, a set of topics (questions), and a corresponding set of relevance judgments (right answers).

Precision

One approach of evaluating an IR system’s performance can be determined using Average Precision. Precision is the proportion of retrieved and relevant documents to all the documents retrieved:

Where ‘Average Precision for Request’ is:

And precision is defined as:

These results will then be compared to the known results of the test collection provided by TREC. The higher the precision values, the better the IR system is at not retrieving non-relevant documents.

Recall

Recall is the ability of an IR system to obtain all or most of the relevant documents in the collection:

This equation can also be applied to a sum to retrieve the average recall of the IR system. These results will then also be compared to the known results of the test collection.

The higher this sum, the better the IR system is performing.

Fall-out

Fall-out determines the proportion of irrelevant documents that are retrieved out of all irrelevant documents available. This is the opposite of recall, and can also be used in evaluating the IR system.

Upon performing these tests against the test collection, an accurate evaluation of  the complete IR system is determined .

4. References

[1] Rosette Language Identifier – Automatically identify text in over 40 languages, http://www.basistech.com/language-identification.

[2] Snowball, Dr Martin Porter & Richard Boulton, http://snowball.tartarus.org

[3] Vector and Probabilistic Ranking, Ray Larson & Marti Hearst, University of California, Berkeley, School of Information Management and Systems.

[4] Advantages of Query Biased Summaries in Information Retrieval, Anastasios Tombros & Mark Sanderson.

[5] Jacobs, P.S., and Rau, L.F. 1990. Scisor: Extracting, information from on-line news. Communications of the ACM, 33( 1 I ):88-97, November 1990.

[6] McKeown, K., and Radev. D.R. 1995. Generating summaries from multiple news articles. In Fox, E.A., Ingwersen, P., and Fidel, R. eds. Proceedings of the Eighteenth. Annual ACM SlGlR Conference on Research and Development in Information Retrieval, 74-82. ACM Press, July 1995.

[7] Brandow, R., Mitze, K., and Rau, L.F. 1995. Automatic condensation of electronic publications by sentence selection. Information Processing & Management 3 l(5): 675-685, September 1995.

[8] Luhn, H.P. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2): 159- 165, April 1958.

[9] TREC, Text REtrieval Conference, http://trec.nist.gov.

[10] Evaluation of a Query-biased Document Summarisation Approach for the

Question Answering Task, Mingfang Wu1 Ross Wilkinson1 Cécile Paris.

[11] Dublin City University at CLEF 2006: Cross-Language Speech Retrieval (CL-SR) Experiments, Gareth J. F. Jones, Ke Zhang and Adenike M. Lam-Adesina, Centre for Digital Video Processing & School of Computing.

[12] The Porter Stemming Algorithm, Revised 2006, Martin Porter.

[13] Multilingual Information Management: Current Levels and Future Abilities, 1999, Eduard Hovy, Nancy Ide & Robert Frederking.