Previous: Fitting the Model to the Data
Up: From Color Space to Color Names
Next: Comparing Performance for Different Color Spaces
To evaluate the category model just described, we will compare the model's
categorial judgments to the Berlin and Kay data, i.e. the data to which it
is fitted. We will do this for each
color sample in the Berlin and Kay stimulus set, which is actually a
superset of the data the model was fitted to.
Note that our
model (normalized Gaussians) implies convex Basic Color category regions,
and while that is not always the case for the regions as depicted in
Figure
(e.g. the green region), it is generally true for
the regions as mapped onto the OCS surface in the various color spaces,
perhaps with some minor local exceptions.
Figures
to
show the results of this comparison.
From visual inspection of these figures it is apparent that the
categorization works better in some spaces than in others. In some spaces a
certain category may ``bleed'' outside of the Berlin and Kay category
boundaries, e.g. in the XYZ space, green intrudes into the brown region
(and these two region's boundaries do not touch in the Berlin and Kay
data), and black into green; blue spills over into the purple region,
purple somewhat into pink, white into pink, and green and blue slightly
into gray. Other categories exceed their boundaries without intruding into
another basic category's region, e.g. yellow in all three spaces (but
particularly in L*a*b*). Another type of mismatch is a category that does
not fill enough of its region (which may coincide with exceeding the
boundaries in other areas), e.g. orange, brown, green, blue, and purple in
XYZ. The same type of errors occur in all three spaces, but to varying
extents. The L*a*b* seems to perform best in general, followed by the NPP
space and the XYZ space. Some errors do not seem as serious as others,
e.g. the row of white judgments for the very pale blue and pink stimuli in
the top row in the NPP space does not seem like a particularly serious
mistake, but judging gray as green or blue in the XYZ space or as purple in
the NPP space does. This judgment is qualitative in nature, of course, and
somewhat subjective. In addition, performance on color-related tasks may be
considered more important than this type of theoretical evaluation (see
Chapter ).
To try and get at a more objective measure of performance, we now turn to quantifying the ``goodness of fit'' of the category models. There are two concurrent criteria we can use: how well the extent (area or volume, depending on which representation we use) of a model category fits the extent of the corresponding category in the data, and how close the model focus of each category is to the corresponding category focus (or focal samples) in the data. The following error metric attempts to capture the first of these (the extent of the categories), and is computed over the complete set of Berlin and Kay stimuli
where is the total error (again a Root Mean Square error metric),
is the number of stimuli in the set,
is the predicted (model)
and
is the expected (data) membership value for stimulus
,
is the maximum membership value for stimulus
over all
category model functions,
is the membership threshold as before,
is the category yielding the
,
is the set of all
data categories that the stimulus belongs to (decided by being within the
category boundaries as indicated on Figures
ff.),
is the membership value for stimulus
in model
category
, and
is stimulus
.
The complicated way of determining
is necessary to deal with various cases such as the predicted
category matching or mismatching the expected category, and with discrete
data versus continuous model. In addition to the error over the complete
data set, the square error for each stimulus is added to the running total
for category
iff
i.e. either when the data or the model says it should belong to category
, so we will get an error when the model category is either too small
or too large, as compared to the same category in the data.
The error in the placement of the category model foci is determined as
follows: for each category , determine the stimulus
that is
closest to its focus
, using the regular Euclidean
distance metric on the color space coordinates. This is also the stimulus
with the maximum membership value
, since that is a monotonic
function of distance to the focus. Then the center of gravity
of the data focal stimuli is determined as in
equation
, and finally the focal error is computed as
the Euclidean distance between the two foci:
where is the focal error,
is the number of dimensions of the
color space (in our case 3), and
is the d-th coordinate of point
. This is approximately equivalent to the distance between
the data focus and the orthogonal projection of the model focus onto the
OCS surface (there may be small deviations because the stimuli and data
focus may lie slightly below the OCS surface). The focal error is computed
per category and averaged over all categories.
The computed errors for the color spaces of interest are shown in
Figure to
.
Comparing these figures, we get confirmation that the L*a*b* space performs best in terms of the extent of the color categories (leftmost bar of the left part of the figures, marked ``ALL''). For each of the categories individually except yellow, the error is less in L*a*b* space than in the other two spaces, and for black, gray, and white (i.e. the gray axis), the error is zero in L*a*b* space but not in the other two spaces. Between the RGB and NPP spaces the differences are smaller, with the overall error virtually the same. For some categories such as pink and white the error is smaller in NPP space, for others such as red and yellow the error is smaller in XYZ space. In terms of the error of focus location, the three spaces are comparable overall (leftmost bar marked ``AVG'' in the figures, with some per-category variation among the spaces.