The results show that users give better subjective
satisfaction to a fewer number
of translations. However, no statistically significant difference
was found between the different treatments in terms of speed,
fallout, and recall measured on non-relevant documents.
[Top]
Introduction
As Internet resources such as
the World Wide Web (WWW) become accessible to more and more
countries, and technological advances overcome the network,
interface and computer system differences which have impeded
information access, it is now more common for searchers to wish to
explore collection of documents that are not written in their native
languages. Beyond merely accepting extended character sets and
performing language identification, the information retrieval
systems should be able to provide help in search for information
across language boundaries.
A Cross Language Information
Retrieval (CLIR) system retrieves documents in a language that is
different from the query language [6]. A user of a CLIR system can
enter a query in one language, but the returned documents will be in
the language of the document collection. One example showing the
usefulness of such a system is when a user might have some knowledge
of the document language but has difficulty formulating effective
queries. These users might very well be able to distinguish good
documents form bad documents based on their limited knowledge. Such
users could then send the documents that they have judged relevant
on to a translation service bureau or machine translation system [4].
A more useful CLIR system would
accept queries in one language and return documents in the same
language, despite the fact that the underlying text retrieval system
has indexed documents in a different language. The acceptance of a
system by users depends heavily on the method into which the
translated documents are presented and on its usefulness. One method
for presenting the translated document is to provide a gloss
translation of the document. In a gloss translation of a document:
- If
the unknown language word has a single gloss in the dictionary, show
that gloss.
- If
the unknown language word has multiple glosses in the dictionary,
show up to n of them for some customizable n.
- If
the unknown language word is not found in the dictionary, then show
the unknown language word itself.
This
presentation of the translated document provides enough information
to support two critical decisions facing the user who has received
this documents [8], namely:
-
Deciding whether a link is worth following.
-
Deciding whether some text is worth having translated.
Another factor determining the
acceptance of a CLIR system by users is the user interface of the
system and the facilities it provides for the users. A good user
interface for a CLIR system should support the following functions
to be done efficiently: the start of the search process, the query
specification, viewing retrieval results in context, and interactive
relevance feedback [5]. Researchers in the area of cross-language
information retrieval give special attention to these factors and
many studies are performed to evaluate different systems.
The Keizai system [1,2] searches
Japanese and Korean web data starting from an English query. The
user interface used a document thumbnail visualization with both the
original and the English summaries dynamically displayed in the
bottom portion of the display for each document when the mouse
pointer is positioned over the document line giving users a very
quick way of evaluating document relevance. Two metrics are used for
relevance judgment: False Hit Ratio and False Drop Ratio.
Determining the best number of
translations displayed per word in a gloss translation of a
retrieved document is an important issue that differentiates among
implementations. While the University of Maryland Translingual
Information Retrieval System [3] displays the three most common
English word translations for each Chinese word, the system
described in [4] shows only one translation per word. The first
system takes English queries and produces English gloss translations
of Chinese documents and the later takes English queries and
produces English gloss translations of Spanish documents. It is
demonstrated that relatively simple term substitution and
disambiguation approaches could be viable for cross language text
retrieval.
Another important question is
whether a CLIR system has the same performance for collections in
different languages. To answer this question, a system that
translates Spanish and French natural language queries into English
was implemented in [7] and a comparison is made between the
performances of the two languages. The authors stated that
linguistic differences might have caused the performance
differences.
The extent to which a gloss
translation presentation of documents helps the user to make
decisions is evaluated in [8]. In the experiment, all subjects are
faced with the same categorization problem, but some of those
subjects are given materials in English to categorize while other
subjects are given the same content to categorize but in the form of
gloss translation. The effectiveness of such a gloss translation is
measured by some distance measures between the decisions made by
subject who read this gloss translation to the ideal case (defined
by the behavior of subjects who received materials in English).
To compare the effectiveness of
different CLIR systems, there should be some measures.
One commonly used single-valued measure for a retrieval system is
van Rijsbergen’s F measure, which is a weighted harmonic mean of
recall and precision [6]:
Where P is the precision (the
fraction of the selected documents that are relevant), R is the
recall (the fraction of the relevant documents that are selected),
and ß is the ratio of the relative importance that the searcher
ascribes to recall and precision. It is often assumed that ß =1
(which results in the unweighted harmonic mean).
Another measure used is the
fallout which is the precision measured based on non-relevant
documents.
In this experiment, we study
some issues in the user interface design of CLIR systems. In
particular, we study some enhancements to the current user interface
design for the University of Maryland Translingual Information
Retrieval System web page [9]. The current search result web page
includes a gloss translation approach in which the words of the
document are translated, one by one, and the three most common
translations for each word are displayed horizontally between
parentheses so that the most common translation is placed first.
This word is displayed in bold for emphasis. Moreover, the words of
the original query are displayed in red.
Users can use this gloss
translation to assess whether or not the document is relevant to the
search topic. We study the effect of changing the number of
translations, e.g. displaying the two, three or
four most common
translations, rather than displaying only three translations. The
choice of displaying one translation per word was not chosen because
the user can, mistakenly, assume that this gloss translation is
correct. A small number of translations per word (e.g. 2
translations) may be too few for users to reach a correct judgment.
However, a large number of translations per word (e.g. 4
translations) may be unnecessary and may slow the users down. The
experiment seeks to find the optimal number of translations per
word.
We hypothesize that increasing
the number of translations per word will decrease the user error. It
also decreases the number of pages for which users take a decision.
The theory behind this reasoning is that increasing the number of
translations per word provides more information to the user to make
a more accurate decision. However, more number of translations per
word leads to more scrolling. Scrolling can be slow and may disrupt
the perception of spatial layout [10].
The proposed subjects' task is
to perform a search using a query that is automatically generated
from a topic description and then judge the relevance (relevant,
irrelevant, or do not know) of as many documents as time will allow.
They will are given a fixed time (5 minutes per query) to perform
the required task.
[Top]
Experiment
Introduction and Hypothesis
The goal of this experiment
was to study some layout strategies for gloss translation. In the
current web page for the “University of Maryland Translingual
Information Retrieval System”, the words of the document are
translated, one by one, and the three most common translations for
each word are displayed between parentheses. The first translation
is displayed in bold and the query words are displayed in red. Using
this gloss translation, the users can assess whether or not the
document is relevant to the search topic. In our experiment we
studied the effect of changing the number of translations, e.g.
displaying the two, three or four most common translations for each
word on the users’ assessment.
Hypothesis 1: Increasing
the number of displayed translations per word will decrease the
percentage of documents that the user can take a decision within a
fixed period of time.
Users
will spend more time to read and to make a decision when more
translations per word are displayed.
Hypothesis 2: Increasing
the number of displayed translations per word will improve the
users’ assessment.
Users can make better
assessments by using the extra information provided by the extra
number of translations.
We expected that users will prefer fewer translations per word,
because will prefer to
make a quick and accurate decision. We will explore this by
asking an informal question in our subjective satisfaction questionnaire
Variables
Independent
variable: The number of displayed translations per word. It has
three treatments; displaying the two, three, and four most common
translations per word (1 x 3 experiment).
Since
the effect of changing the number of displayed translations on the
users’ assessment can be measured by the fallout (the fraction
of the selected documents that are non-relevant) and the Recall
measured on non-relevant documents (the
fraction of the non-relevant documents that are selected). This
experiment had four dependent variables as follows:
Dependent
variables:
1. The percentage of documents that the user could take a decision
within a fixed period of time.
2.
Precision measured using non-relevant documents (Fallout).
3.
Recall measured using non-relevant documents.
4.
The user satisfaction.
Pilot Study Results
Three
subjects participated in the pilot study. The three participants
were students in the computer science department at the University
of Maryland. Each subject spent about fifteen minutes to answer the
questionnaire, to read the experiment instruction and to complete
the training. Each subject was guided through the training session
by one of us to answer any questions about the system and the tasks.
The subjects were allowed to spend a maximum of five minutes to
judge the relevance of the documents for each query. Since we had
three treatments and two queries for each treatment, the actual
experiment took 30 minutes. The users answered the subjective
satisfaction survey in about five minutes. Only two out of the three
subjects answered the subjective satisfaction survey. The third
subject left before answering the survey. The pilot study pointed
out some problems in the experiment and the tasks. Following is a
list of the changes we had made based on the pilot study results:
1. The
instructions form was modified to be more compact and to contain
explanations for the meaning of the bold and the red words in the
system layout, since all the subjects complained that the
instructions form was very long, and asked questions about the
meaning of the bold and the red words in the system layout.
2. The tasks were
modified to be shorter. The new tasks were to judge the relevance
of documents for one query per treatment rather than two queries
per treatment, since all subjects complained that the tasks were
very long.
The
subjective satisfaction survey for the pilot study showed that the
subjects preferred displaying one or two translations per word to
displaying three or four translations per word. This result was not
surprising as that matched our hypothesis about the users’
preferences.
Subjects
Eighteen
subjects participated in the experiment. Among the 18 subjects,
there were 14 males and 4 females. The age of the subjects ranged
from 20 to 30 years old. All the subjects were students at the
University of Maryland. Fifteen were graduate students, one senior,
one junior, and one sophomore student. The majority were Computer
Science students, the others were Electrical Engineering, Mechanical
engineering, or Computer Engineering students. Seventeen subjects
were daily users of the web to locate information and only one was
an occasionally user (three to five times a week). Seven of the
subjects were using only English in browsing the web, the others
were using English beside other language like Arabic, Korean,
Turkish, Spanish, French, and Farsi. Nine subjects had intermediate
experience (search for a combination of 2 or more words) with using
search engines, 8 had advanced experience (use a search query), and
one was a novice user (search for a specific keyword). Only 5
subjects used to locate information in languages they do not know by
either translating the search query to the target language first or
by using systems that allows writing the query in one language while
searching the documents in another language. Among the eighteen
subjects, only 2 subjects used web sites that do automatic
translation like AltaVista before. The subjects were all proficient
or native English speakers, with no knowledge of the Chinese
language. Each subject tested the three treatments for the number of
displayed translations per word. In order to minimize the effect of
getting used to the system on the user performance in each treatment
we changed the order each subject used to test the treatments as
follows: we had 3 treatments that can be permuted in 6 different
ways, and we had 18 subjects to test the three treatments, we
assigned each 3 subjects to one of the 6 permutations. For example,
the first three subjects tested the 2 then 3 then 4 translations per
word, the next three subjects tested the 3 then 2 then 4
translations per word, and so on.
Material
Software: The
experiment software is based on the “University of Maryland
Translingual Information Retrieval System”. In the current system,
users can search Chinese documents by issuing a query in English.
The words of the Chinese document are translated, one by one, and
the three most common translations for each word are displayed
between parentheses. The users then use the search results to assess
the relevance of the retrieved document to their queries. Our
experiment software was implemented by modifying the code (Perl
scripts that performs segmentation, translation, indexing, etc) for
the current system to support the display of two and four
translations per word. About 12500 documents were used. All the
documents were articles taken from the “Hong Kong Daily News”
newspaper. A single script took on the average half an hour to run
and each run produced about 4GB of data.
Paper-Based Forms:
Three paper-based forms were prepared for the experiment: the
Instructions form, the Questionnaire, and the Subjective
Satisfaction Survey. The Instructions form had three sections, an
introduction section, a practicing section, and the tasks section.
The Questionnaire consisted of some questions about the subjects
(e.g. name, age, gender, etc), and some questions about their
previous experience with using the web for locating information. The
Subjective Satisfaction Survey included questions about the system
and its layout, the training, the subjects’ preferences, comments,
and suggestions. (The three forms can be found in the Appendices)
Procedures and Problems
The
experiment was performed in groups of one, two, or three subjects at
a time. Each one of us was responsible for conducting the experiment
for six subjects. Each session of the experiment had the following
steps:
- The subject fills out the Background
Questionnaire form.
- The subject reads the Instructions form.
- The subject performs a training session
in which a brief introduction and explanation of the system and
its layout is given by one of us, then the subject is allowed
five minutes to practice.
- The subject performs three tasks. Each task
corresponds to one of the three treatments. The treatments were
permuted as mentioned before. Each task is to perform a search
using a query that is automatically generated from a topic
description and then judge the relevance (relevant, irrelevant,
or do not know) of as many documents as time allows. The subject
is allowed a maximum of five minutes to complete each task.
- The subject answers the Subjective Satisfaction
Survey form.
When the subject starts the experiment, a
“Welcome Page” was displayed as shown in Figure 1.

Figure 1: Welcome Page
When the subject clicks
“Practice Query”, the practice query web page is displayed as
shown in Figure 2.

Figure 2: The
“Practice Query” web page
When subject chooses a
query and clicks “Search” the automatically generated query will
be displayed as shown in Figure 3.

Figure 3: The
automatically generated query web page
When the subject clicks
“View documents retrieved” the search results will be displayed
in a new window as the one shown in Figure 4.

Figure 4: the Search
Results
The subject uses the
above window to judge the relevance. The subjects can judge the
relevance either based upon the document title only or by viewing
the whole document by clicking the document title. Figure 5 shows a
typical layout for a whole document.

Figure 5: Document
layout
The actual tasks had
the same layout as the one used in the training session except the
number of displayed translations per word which varied according to
the treatment. The subjects’ judgments were automatically stored
in files.
The main problem we
encountered was the loading time of the “whole document” web
page. This web page always took a long time to load. This affected
the subjects’ behaviors as some of them avoided loading the whole
documents and used only the title of the document to judge the
relevance or judged the relevance as “do not know”.
[Top]
Results
We recorded the number
of documents that each subject was able to judge their relevance
within the allowed five minutes and derived the Precision and the
Recall from the raw data. Since most of the displayed documents were
not relevant to the search query, we measured the Precision and the
Recall for the non-relevant documents rather than the relevant
documents. The single factor analysis of variance (ANOVA) on the
three treatments for the percentage of documents the subject was
able to judge their relevance, the fallout, and the Recall measured
on non-relevant documents showed
that our results were not statistically significant at Alpha = 0.05,
so the null hypothesis cannot be rejected. We also collected the
results for the subjective satisfaction from the Subjective
Satisfaction Survey forms. (The complete set of raw data can be
found in the Appendices section).
1. The percentage of documents the user could
make a decision
Figure
1 shows the average percentage of the documents the user could make
a decision along with the standard deviations for the three
treatments.
Figure 1:
the average percentage of the documents the user could make a
decision
The following tables
summarize the results for the single factor ANOVA:
|
Groups
|
Count
|
Sum
|
Average
|
Variance
|
|
2
words
|
18.00
|
462.00
|
25.67
|
165.76
|
|
3
words
|
18.00
|
406.00
|
22.56
|
112.85
|
|
4
words
|
18.00
|
374.00
|
20.78
|
101.24
|
|
Source
of Variation
|
SS
|
df
|
MS
|
F
|
P-value
|
F
crit
|
|
Between
Groups
|
220.44
|
2.00
|
110.22
|
0.87
|
0.42
|
3.18
|
|
Within
Groups
|
6457.56
|
51.00
|
126.62
|
|
|
|
|
Total
|
6678.00
|
53.00
|
|
|
|
|
We also analyzed the
treatments pair wise (each pair of treatments) using ANOVA at Alpha
= 0.05 and it showed that our results were not statistically
significant. The pair wise ANOVA results are shown in the Appendices
section.
2. Precision measured on non-relevant documents
(Fallout)
Figure 2 shows the average fallout along with the standard
deviation for the three treatments.
Figure 2:
The average Precision
The
following tables summarize the results for the single factor ANOVA:
|
Groups
|
Count
|
Sum
|
Average
|
Variance
|
|
2
words
|
18
|
18
|
1
|
0
|
|
3
words
|
18
|
18
|
1
|
0
|
|
4
words
|
18
|
18
|
1
|
0
|
|
Source
of Variation |
SS
|
df
|
MS
|
F
|
P-value
|
F
crit
|
|
Between
Groups
|
0
|
2.00
|
0
|
65535
|
0
|
3.187
|
|
Within
Groups
|
0
|
51.00
|
0
|
|
|
|
|
Total
|
0
|
53.00
|
|
|
|
|
The
raw data showed that there are no differences between the three
treatments. The Precision had the same value
for all the subjects in the three treatments (mean =1, variance =
0). We did not need to run additional paired t-tests.
3. Recall measured on non-relevant documents
Figure 3 shows the average Recall along with the standard
deviation for the three treatments.
Figure 3:
The average Recall
The
following tables summarize the results for the single factor ANOVA:
|
Groups
|
Count
|
Sum
|
Average
|
Variance
|
|
2
words
|
18.00
|
12.93
|
0.72
|
0.08
|
|
3
words
|
18.00
|
11.46
|
0.64
|
0.06
|
|
4
words
|
18.00
|
13.24
|
0.74
|
0.04
|
|
Source
of Variation
|
SS
|
df
|
MS
|
F
|
P-value
|
F
crit
|
|
Between
Groups
|
0.10
|
2.00
|
0.05
|
0.82
|
0.44
|
3.18
|
|
Within
Groups
|
3.11
|
51.00
|
0.06
|
|
|
|
|
Total
|
3.21
|
53.00
|
|
|
|
|
4. Subjective Satisfaction
Figures
4-10 summarize the results for the subjective satisfaction survey.
For the optimal number of translations per word, the majority of the
subjects (66% of the subjects) preferred displaying only 2
translations per word, 17% of them preferred displaying only 1
translation, 11% of the subjects preferred displaying 4
translations, and only 6% preferred displaying 3 translations per
word.
Figure 4:
Optimal number of translations per word
We
were interested in the subjects’ responses for the following five
questions:
Q1-a:
Overall Reaction to the System (1:Difficult -- 9:Easy)
Q1-b:
Overall Reaction to the System (1:Frustrating -- 9:Satisfying)
Q2 : Overall System Efficiency (1:Efficient --
9:Ineffecient)
Q3 : Screen Layout were Helpful (1:Never -- 9:Always)
Q4 : Learning to Operate the System (1:Difficult -- 9:Easy)
Figure
5 summarizes the subjects’ responses for the previous questions.
Figures 6-10 shows the subjects’ responses for each of these
questions.
Figure
5: Average Subjective Satisfaction
Figure
6: Q1-a: Overall Reaction to the system
Figure
7: Q1-b: Overall Reaction to the system
Figure
8: Q2: Overall system efficiency
Figure
9: Q3: Screen layout was helpful
Figure
10: Q4: Learning to operate the system
[Top]
Discussion
Number of documents in which the user took a
decision
Though the number of
documents in which the user took a decision does decrease as the
number of translations per word increases as we hypothesized, the
difference is not statistically significant. The large variance in
the number of documents in which th