SHORE 2001 Logo
SHORE 2001 Logo University of Maryland Logo
Student HCI Online Research Experiments
Abstract
Introduction
Experiment
Results
Discussion
Conclusions
Acknowledgements
References
Appendices
Credits
Feedback
PowerPoint Slides
SHORE 2001 : Layout and Readability  

Cross-Language Information Retrieval: Layout Strategies for Gloss Translation

Authors

Eiman M. Elnahrawy (eiman@cs.umd.edu)

Nagia M. Ghanem (ghanem@cfar.umd.edu)

Moustafa A. Youssef (moustafa@cs.umd.edu)

Abstract

Cross-Language Information Retrieval (CLIR) has been referred as "the problem of finding documents for people, which they cannot read". However, this is not strictly true. For example, multilingual searchers might issue a single query into a multilingual collection, or searcher with a limited active vocabulary, but good reading comprehension in a second language, might prefer to issue queries in their most fluent language.

In this experiment, we study some issues on the user interface design of CLIR. In particular, we study some enhancements to the current user interface design for the University of Maryland Translingual Information Retrieval System web page. The current search result web page includes a gloss translation approach in which the words of the document are translated, one by one, and the three most common translations for each word are displayed horizontally between parentheses. The user can use this gloss translation to assess whether or not the document is relevant to the search topic. We propose to study changing the number of translations, e.g. displaying the two or four most common translations, for each word rather than to the current three translations.
The results show that users give better subjective satisfaction to a fewer number of translations. However, no statistically significant difference was found between the different treatments in terms of speed, fallout, and recall measured on non-relevant documents.

[Top]

Introduction

As Internet resources such as the World Wide Web (WWW) become accessible to more and more countries, and technological advances overcome the network, interface and computer system differences which have impeded information access, it is now more common for searchers to wish to explore collection of documents that are not written in their native languages. Beyond merely accepting extended character sets and performing language identification, the information retrieval systems should be able to provide help in search for information across language boundaries.

A Cross Language Information Retrieval (CLIR) system retrieves documents in a language that is different from the query language [6]. A user of a CLIR system can enter a query in one language, but the returned documents will be in the language of the document collection. One example showing the usefulness of such a system is when a user might have some knowledge of the document language but has difficulty formulating effective queries. These users might very well be able to distinguish good documents form bad documents based on their limited knowledge. Such users could then send the documents that they have judged relevant on to a translation service bureau or machine translation system [4].     

A more useful CLIR system would accept queries in one language and return documents in the same language, despite the fact that the underlying text retrieval system has indexed documents in a different language. The acceptance of a system by users depends heavily on the method into which the translated documents are presented and on its usefulness. One method for presenting the translated document is to provide a gloss translation of the document. In a gloss translation of a document:

- If the unknown language word has a single gloss in the dictionary, show that gloss.

- If the unknown language word has multiple glosses in the dictionary, show up to n of them for some customizable n.

- If the unknown language word is not found in the dictionary, then show the unknown language word itself.

 This presentation of the translated document provides enough information to support two critical decisions facing the user who has received this documents [8], namely:

- Deciding whether a link is worth following.

- Deciding whether some text is worth having translated. 

Another factor determining the acceptance of a CLIR system by users is the user interface of the system and the facilities it provides for the users. A good user interface for a CLIR system should support the following functions to be done efficiently: the start of the search process, the query specification, viewing retrieval results in context, and interactive relevance feedback [5]. Researchers in the area of cross-language information retrieval give special attention to these factors and many studies are performed to evaluate different systems.

The Keizai system [1,2] searches Japanese and Korean web data starting from an English query. The user interface used a document thumbnail visualization with both the original and the English summaries dynamically displayed in the bottom portion of the display for each document when the mouse pointer is positioned over the document line giving users a very quick way of evaluating document relevance. Two metrics are used for relevance judgment: False Hit Ratio and False Drop Ratio.

Determining the best number of translations displayed per word in a gloss translation of a retrieved document is an important issue that differentiates among implementations. While the University of Maryland Translingual Information Retrieval System [3] displays the three most common English word translations for each Chinese word, the system described in [4] shows only one translation per word. The first system takes English queries and produces English gloss translations of Chinese documents and the later takes English queries and produces English gloss translations of Spanish documents. It is demonstrated that relatively simple term substitution and disambiguation approaches could be viable for cross language text retrieval.  

Another important question is whether a CLIR system has the same performance for collections in different languages. To answer this question, a system that translates Spanish and French natural language queries into English was implemented in [7] and a comparison is made between the performances of the two languages. The authors stated that linguistic differences might have caused the performance differences.

The extent to which a gloss translation presentation of documents helps the user to make decisions is evaluated in [8]. In the experiment, all subjects are faced with the same categorization problem, but some of those subjects are given materials in English to categorize while other subjects are given the same content to categorize but in the form of gloss translation. The effectiveness of such a gloss translation is measured by some distance measures between the decisions made by subject who read this gloss translation to the ideal case (defined by the behavior of subjects who received materials in English).  

To compare the effectiveness of different CLIR systems, there should be some measures.   One commonly used single-valued measure for a retrieval system is van Rijsbergen’s F measure, which is a weighted harmonic mean of recall and precision [6]:

 

Where P is the precision (the fraction of the selected documents that are relevant), R is the recall (the fraction of the relevant documents that are selected), and ß is the ratio of the relative importance that the searcher ascribes to recall and precision. It is often assumed that ß =1 (which results in the unweighted harmonic mean).

Another measure used is the fallout which is the precision measured based on non-relevant documents.

In this experiment, we study some issues in the user interface design of CLIR systems. In particular, we study some enhancements to the current user interface design for the University of Maryland Translingual Information Retrieval System web page [9]. The current search result web page includes a gloss translation approach in which the words of the document are translated, one by one, and the three most common translations for each word are displayed horizontally between parentheses so that the most common translation is placed first. This word is displayed in bold for emphasis. Moreover, the words of the original query are displayed in red.

Users can use this gloss translation to assess whether or not the document is relevant to the search topic. We study the effect of changing the number of translations, e.g. displaying the two, three or four most common translations, rather than displaying only three translations. The choice of displaying one translation per word was not chosen because the user can, mistakenly, assume that this gloss translation is correct. A small number of translations per word (e.g. 2 translations) may be too few for users to reach a correct judgment. However, a large number of translations per word (e.g. 4 translations) may be unnecessary and may slow the users down. The experiment seeks to find the optimal number of translations per word.

We hypothesize that increasing the number of translations per word will decrease the user error. It also decreases the number of pages for which users take a decision. The theory behind this reasoning is that increasing the number of translations per word provides more information to the user to make a more accurate decision. However, more number of translations per word leads to more scrolling. Scrolling can be slow and may disrupt the perception of spatial layout [10].

The proposed subjects' task is to perform a search using a query that is automatically generated from a topic description and then judge the relevance (relevant, irrelevant, or do not know) of as many documents as time will allow. They will are given a fixed time (5 minutes per query) to perform the required task.

[Top]

Experiment

Introduction and Hypothesis

The goal of this experiment was to study some layout strategies for gloss translation. In the current web page for the “University of Maryland Translingual Information Retrieval System”, the words of the document are translated, one by one, and the three most common translations for each word are displayed between parentheses. The first translation is displayed in bold and the query words are displayed in red. Using this gloss translation, the users can assess whether or not the document is relevant to the search topic. In our experiment we studied the effect of changing the number of translations, e.g. displaying the two, three or four most common translations for each word on the users’ assessment.

Hypothesis 1: Increasing the number of displayed translations per word will decrease the percentage of documents that the user can take a decision within a fixed period of time.

Users will spend more time to read and to make a decision when more translations per word are displayed.

Hypothesis 2: Increasing the number of displayed translations per word will improve the users’ assessment.

Users can make better assessments by using the extra information provided by the extra number of translations.

We expected that users will prefer fewer translations per word, because will prefer to make a quick and accurate decision. We will explore this by asking an informal question in our subjective satisfaction questionnaire

Variables

Independent variable: The number of displayed translations per word. It has three treatments; displaying the two, three, and four most common translations per word (1 x 3 experiment).

Since the effect of changing the number of displayed translations on the users’ assessment can be measured by the fallout (the fraction of the selected documents that are non-relevant) and the Recall measured on non-relevant documents (the fraction of the non-relevant documents that are selected). This experiment had four dependent variables as follows:

Dependent variables:

1.     The percentage of documents that the user could take a decision within a fixed period of time.

2.      Precision measured using non-relevant documents (Fallout).

3.      Recall measured using non-relevant documents.

4.      The user satisfaction.

 

Pilot Study Results

Three subjects participated in the pilot study. The three participants were students in the computer science department at the University of Maryland. Each subject spent about fifteen minutes to answer the questionnaire, to read the experiment instruction and to complete the training. Each subject was guided through the training session by one of us to answer any questions about the system and the tasks. The subjects were allowed to spend a maximum of five minutes to judge the relevance of the documents for each query. Since we had three treatments and two queries for each treatment, the actual experiment took 30 minutes. The users answered the subjective satisfaction survey in about five minutes. Only two out of the three subjects answered the subjective satisfaction survey. The third subject left before answering the survey. The pilot study pointed out some problems in the experiment and the tasks. Following is a list of the changes we had made based on the pilot study results:

1.  The instructions form was modified to be more compact and to contain explanations for the meaning of the bold and the red words in the system layout, since all the subjects complained that the instructions form was very long, and asked questions about the meaning of the bold and the red words in the system layout.

2.  The tasks were modified to be shorter. The new tasks were to judge the relevance of documents for one query per treatment rather than two queries per treatment, since all subjects complained that the tasks were very long.

The subjective satisfaction survey for the pilot study showed that the subjects preferred displaying one or two translations per word to displaying three or four translations per word. This result was not surprising as that matched our hypothesis about the users’ preferences.

 

Subjects

Eighteen subjects participated in the experiment. Among the 18 subjects, there were 14 males and 4 females. The age of the subjects ranged from 20 to 30 years old. All the subjects were students at the University of Maryland. Fifteen were graduate students, one senior, one junior, and one sophomore student. The majority were Computer Science students, the others were Electrical Engineering, Mechanical engineering, or Computer Engineering students. Seventeen subjects were daily users of the web to locate information and only one was an occasionally user (three to five times a week). Seven of the subjects were using only English in browsing the web, the others were using English beside other language like Arabic, Korean, Turkish, Spanish, French, and Farsi. Nine subjects had intermediate experience (search for a combination of 2 or more words) with using search engines, 8 had advanced experience (use a search query), and one was a novice user (search for a specific keyword). Only 5 subjects used to locate information in languages they do not know by either translating the search query to the target language first or by using systems that allows writing the query in one language while searching the documents in another language. Among the eighteen subjects, only 2 subjects used web sites that do automatic translation like AltaVista before. The subjects were all proficient or native English speakers, with no knowledge of the Chinese language. Each subject tested the three treatments for the number of displayed translations per word. In order to minimize the effect of getting used to the system on the user performance in each treatment we changed the order each subject used to test the treatments as follows: we had 3 treatments that can be permuted in 6 different ways, and we had 18 subjects to test the three treatments, we assigned each 3 subjects to one of the 6 permutations. For example, the first three subjects tested the 2 then 3 then 4 translations per word, the next three subjects tested the 3 then 2 then 4 translations per word, and so on.

 

Material

Software: The experiment software is based on the “University of Maryland Translingual Information Retrieval System”. In the current system, users can search Chinese documents by issuing a query in English. The words of the Chinese document are translated, one by one, and the three most common translations for each word are displayed between parentheses. The users then use the search results to assess the relevance of the retrieved document to their queries. Our experiment software was implemented by modifying the code (Perl scripts that performs segmentation, translation, indexing, etc) for the current system to support the display of two and four translations per word. About 12500 documents were used. All the documents were articles taken from the “Hong Kong Daily News” newspaper. A single script took on the average half an hour to run and each run produced about 4GB of data.

Paper-Based Forms: Three paper-based forms were prepared for the experiment: the Instructions form, the Questionnaire, and the Subjective Satisfaction Survey. The Instructions form had three sections, an introduction section, a practicing section, and the tasks section. The Questionnaire consisted of some questions about the subjects (e.g. name, age, gender, etc), and some questions about their previous experience with using the web for locating information. The Subjective Satisfaction Survey included questions about the system and its layout, the training, the subjects’ preferences, comments, and suggestions. (The three forms can be found in the Appendices)

Procedures and Problems

The experiment was performed in groups of one, two, or three subjects at a time. Each one of us was responsible for conducting the experiment for six subjects. Each session of the experiment had the following steps:

  1. The subject fills out the Background Questionnaire form.
  2.  The subject reads the Instructions form.
  3.  The subject performs a training session in which a brief introduction and explanation of the system and its layout is given by one of us, then the subject is allowed five minutes to practice.
  4. The subject performs three tasks. Each task corresponds to one of the three treatments. The treatments were permuted as mentioned before. Each task is to perform a search using a query that is automatically generated from a topic description and then judge the relevance (relevant, irrelevant, or do not know) of as many documents as time allows. The subject is allowed a maximum of five minutes to complete each task.
  5. The subject answers the Subjective Satisfaction Survey form.

When the subject starts the experiment, a “Welcome Page” was displayed as shown in Figure 1.

Figure 1: Welcome Page

When the subject clicks “Practice Query”, the practice query web page is displayed as shown in Figure 2.

Figure 2: The “Practice Query” web page

When subject chooses a query and clicks “Search” the automatically generated query will be displayed as shown in Figure 3.

Figure 3: The automatically generated query web page

When the subject clicks “View documents retrieved” the search results will be displayed in a new window as the one shown in Figure 4.

Figure 4: the Search Results

The subject uses the above window to judge the relevance. The subjects can judge the relevance either based upon the document title only or by viewing the whole document by clicking the document title. Figure 5 shows a typical layout for a whole document.

Figure 5: Document layout

The actual tasks had the same layout as the one used in the training session except the number of displayed translations per word which varied according to the treatment. The subjects’ judgments were automatically stored in files.

The main problem we encountered was the loading time of the “whole document” web page. This web page always took a long time to load. This affected the subjects’ behaviors as some of them avoided loading the whole documents and used only the title of the document to judge the relevance or judged the relevance as “do not know”.

[Top]

Results

We recorded the number of documents that each subject was able to judge their relevance within the allowed five minutes and derived the Precision and the Recall from the raw data. Since most of the displayed documents were not relevant to the search query, we measured the Precision and the Recall for the non-relevant documents rather than the relevant documents. The single factor analysis of variance (ANOVA) on the three treatments for the percentage of documents the subject was able to judge their relevance, the fallout, and the Recall measured on non-relevant documents showed that our results were not statistically significant at Alpha = 0.05, so the null hypothesis cannot be rejected. We also collected the results for the subjective satisfaction from the Subjective Satisfaction Survey forms. (The complete set of raw data can be found in the Appendices section).

 

1. The percentage of documents the user could make a decision

Figure 1 shows the average percentage of the documents the user could make a decision along with the standard deviations for the three treatments.

Figure 1: the average percentage of the documents the user could make a decision

 

The following tables summarize the results for the single factor ANOVA:

Groups

Count

Sum

Average

Variance

2 words

18.00

462.00

25.67

165.76

3 words

18.00

406.00

22.56

112.85

4 words

18.00

374.00

20.78

101.24

Source of Variation

SS

df

MS

F

P-value

F crit

Between Groups

220.44

2.00

110.22

0.87

0.42

3.18

Within Groups

6457.56

51.00

126.62

 

 

 

Total

6678.00

53.00

 

 

 

 

We also analyzed the treatments pair wise (each pair of treatments) using ANOVA at Alpha = 0.05 and it showed that our results were not statistically significant. The pair wise ANOVA results are shown in the Appendices section. 

2. Precision measured on non-relevant documents (Fallout)

Figure 2 shows the average fallout along with the standard deviation for the three treatments.

Figure 2: The average Precision

The following tables summarize the results for the single factor ANOVA: 

Groups

Count

Sum

Average

Variance

2 words

18

18

1

0

3 words

18

18

1

0

4 words

18

18

1

0

 

Source of Variation

SS

df

MS

F

P-value

F crit

Between Groups

0

2.00

0

65535

0

3.187

Within Groups

0

51.00

0

 

 

 

Total

0

53.00

 

 

 

 

The raw data showed that there are no differences between the three treatments. The Precision had the same value for all the subjects in the three treatments (mean =1, variance = 0). We did not need to run additional paired t-tests.

 

3. Recall measured on non-relevant documents

Figure 3 shows the average Recall along with the standard deviation for the three treatments.

 

Figure 3: The average Recall

 The following tables summarize the results for the single factor ANOVA: 

Groups

Count

Sum

Average

Variance

2 words

18.00

12.93

0.72

0.08

3 words

18.00

11.46

0.64

0.06

4 words

18.00

13.24

0.74

0.04

 

Source of Variation

SS

df

MS

F

P-value

F crit

Between Groups

0.10

2.00

0.05

0.82

0.44

3.18

Within Groups

3.11

51.00

0.06

 

 

 

Total

3.21

53.00

 

 

 

 

4. Subjective Satisfaction

Figures 4-10 summarize the results for the subjective satisfaction survey. For the optimal number of translations per word, the majority of the subjects (66% of the subjects) preferred displaying only 2 translations per word, 17% of them preferred displaying only 1 translation, 11% of the subjects preferred displaying 4 translations, and only 6% preferred displaying 3 translations per word.


 Figure 4: Optimal number of translations per word  

We were interested in the subjects’ responses for the following five questions: 

Q1-a: Overall Reaction to the System (1:Difficult -- 9:Easy)

Q1-b: Overall Reaction to the System (1:Frustrating -- 9:Satisfying)

Q2  : Overall System Efficiency (1:Efficient -- 9:Ineffecient)

Q3  : Screen Layout were Helpful (1:Never -- 9:Always)

Q4  : Learning to Operate the System (1:Difficult -- 9:Easy)

 

Figure 5 summarizes the subjects’ responses for the previous questions. Figures 6-10 shows the subjects’ responses for each of these questions.

 

Figure 5: Average Subjective Satisfaction

   

Figure 6: Q1-a: Overall Reaction to the system

Figure 7:  Q1-b: Overall Reaction to the system

 

Figure 8: Q2:  Overall system efficiency

 

Figure 9: Q3: Screen layout was helpful

 

Figure 10: Q4: Learning to operate the system

[Top]

Discussion

Number of documents in which the user took a decision

Though the number of documents in which the user took a decision does decrease as the number of translations per word increases as we hypothesized, the difference is not statistically significant. The large variance in the number of documents in which th