Download Managing the `Evaluator Effect` in User Testing

Transcript
breakdowns in the interface. This means that for
small groups of evaluators, the detection rate
becomes overly high. The second measure, Cohen’s
Kappa,
(Cohen, 1960) presupposes a similar
assumption. It assumes that the total number of
breakdowns is known (or that it can reliably be
estimated). For small numbers of evaluators this
typically is not the case. Therefore, Hertzum and
Jacobsen (2001) suggest using the any-two
agreement measure in such cases.
The any-two agreement measure is defined as the
average of
Pi ∩ Pj
Pi ∪ Pj
(over all ½n(n-1) pairs of evaluators).
In this equation, Pi and Pj are the sets of problems
detected by evaluator i and evaluator j, and n is the
number of evaluators. For the studies reported here,
the any-two agreement measure is used.
As one of the units of comparison, the number of
breakdowns is used. In section 3.1 it was explained
how evaluators can use DEVAN to create lists of
occurrences of breakdowns in interactions. In
studies two and three, DEVAN was used to its full
extent; in study one, evaluators only used DEVAN’s
checklist of breakdown indication types (which is
somewhat comparable to Jacobsen et al.’s (1998) list
of usability criteria), as well as a specified format for
reporting the breakdowns (comparable to that of
Jacobsen et al. (1998)).
In Jacobsen et al.’s (1998) study, Unique
Problem Tokens (UPTs) were chosen as units of
comparison. In their study, each evaluator's problem
report (problem reports are comparable to
breakdown descriptions in DEVAN) was examined
as to whether it was unique or duplicated. Thus, a
final list of UPTs was created. To be able to
compare the present studies’ evaluator effects to the
evaluator effect Jacobsen et al. (1998) found,
duplicate breakdowns were filtered out. Breakdowns
had to be similar in much detail (in content as well
as in level of abstraction) for being considered
duplicates. For example, the breakdown “user is
puzzled about how to get a still image” is considered
different from (the more concrete, but similar
breakdown) “user tries <stop> to get a still image”.
Also, in its detailed content, the breakdown: “user
uses <cursor down> instead of <cursor right>
while trying to set a menu-item” is different from
“user uses <cursor down> instead of <cursor left>
while trying to set a menu-item”. Thus these are
regarded as unique breakdowns.
4 Results
Table 2 shows the evaluator effects that were found
in the three studies, as well as the evaluator effect
found by Jacobsen et al. (1998). The evaluator
effects were measured in terms of occurrences of
breakdowns as well as in terms of UPTs. Of the
three studies reported in this paper, the data analysis
process of study one (the Jammin” Draw study)
compares best to that of Jacobsen et al. The study
seems to yield somewhat higher agreements than
Occurrences of Breakdowns
detected
Any-two
detected
in total
agreement
by both
evaluators
Study 1: Jammin" draw user A (n=2)
Jammin" draw user B (n=2)
Study 2: Thermostat (n=2)
Study 3: TV video user A (n=2)
TV video user B (n=2)
Jacobsen et al. (1998):
Multimedia authoring software (n=4)
8
9
30
43
66
15
15
49
54
110
53%
60%
61%
80%
60%
Unique Problem Tokens (UPTs)
detected
Any-two
detected
in total
agreement
by both
evaluators
8
9
21
26
42
15
14
33
32
61
27*)
39**)
93***)
53%
64%
64%
81%
69%
42%
Table 2. Overview of measured evaluator effects. As a measure for the evaluator effect the ‘Any-two agreement’ as
proposed by Hertzum and Jacobsen (2001) is used. Its value ranges from 0% in case of no agreement to 100% in case of
full agreement; n signifies the number of evaluators.
*)
Average number of UPTs found by two evaluators based on data from one user. This figure compares best to the figures
of study 1, 2 and 3.
**)
Average number of UPTs found by two evaluators based on data from four users.
***)
Number of UPTs found by four evaluators based on data from four users.