Statistical testing of individual differences in sensory profiling.pdf

(321 KB) Pobierz
doi:10.1016/S0950-3293(03)00007-7
Food Quality and Preference 14 (2003) 425–434
Statistical testing of individual differences in sensory profiling
Per Bruun Brockhoff *
Department of Mathematics and Physics, Royal Veterinary and Agricultural University, Thorvaldsensvej 40, DK-1871 Frederiksberg C, Denmark
Received 29 July 2002; received in revised form 21 November 2002; accepted 28 November 2002
Abstract
A generalization of the approach of Brockhoff and Skovgaard [Food Quality & Preference, 5 (1994), 215] is given. The emphasis
is on univariate assessor performance in sensory profiling. Statistical significance tests for difference between assessors of scaling,
variability and sensitivity will be given. A test for disagreement effect is also presented. In addition the approach will provide indi-
vidual scaling, variability, disagreement and sensitivity values, that can be used for subsequent tabulation, plotting and statistical
analysis. The method of maximum likelihood is used throughout and all computations are implemented in a SAS 1 Macro PAN-
MODEL that is available via the author’s homepage: http://www.dina.kvl.dk/ per.
# 2003 Elsevier Science Ltd. All rights reserved.
Keywords: Analysisof variance; Assessor performance; Individual differences; Maximum likelihood analysis; Multipicative interaction; SAS macro;
Sensory profile data; Statistical testing
1. Introduction
sample s, where r=1,...,R, a=1,...,A and s=1,..., S.
The standard ANOVA model with sample and assessor
main effects together with the sampleassessor interac-
tion effect may be written as
Y asr ¼ a þ s þ as þ " asr ;
" asr N 0; 2
The investigation of individual differences in sensory
profiling has been the topic of several papers over the
years, see, for example, Lundahl and McDaniel (1990,
1991), Naes and Solheim(1991), and Brockhoff and
Skovgaard (1994) . The present paper is a continuation
of the latter with two main novel contributions: (1)
Providing a statistical significance test for sensitivity
differences. (2) Presenting the entire approach as a
practicaltoolfortheanalysisofindividualdifferencesin
a full sensory profile data set supported by a publically
availableSAS 1 macro.Aconceptualdifference between
the presented approach and other approaches is that
formal statistical models are specified and formal max-
imum likelihood theory is used. Although earlier meth-
ods also provided significance tests for certain effects we
give here a more comprehensive collection of formally
performed likelihood ratio tests.
Consider an over-simplified sensory experiment with
no session effects. Assume also that we have a simple
oneway sample structure, that is, the samples represent
only one single ‘treatment factor’. Let Y asr denote the
rth replicate of a sensory score given by assessor a for
and independent
ð1Þ
The application of this model has the purpose to get
information about the samples ( s ) but accounting
properly for possible assessor differences. For this we
would typically assume that a and as are random
variables leading to a so-called mixed model ANOVA.
Viewpoints on how to performan ANOVA properly in
the sensory context are given in the papers Steinsholt
(1998), Naes and Langsrud (1998) and discussion
papers, for example, Brockhoff (1998b) . This is not the
issueofthepresentapproach.Insteadthefocusisonthe
individual differences. Fig. 1 illustrates what differences
we may encounter on a univariate scale. Together, scal-
ing differences and disagreements constitute the asses-
sorsample interaction. The disagreement part of the
interaction is, apart fromthe exact inverse ranking,
what Stone and Sidel (1985) refers to as ‘cross-over’
interaction. And the scaling part corresponds to ‘mag-
nitudeinteraction’intheterminologyof StoneandSidel
(1985) .
* Tel.: +45-35-28-23-61; fax: +45-35-28-23-50.
E-mail address: pmb@kvl.dk (P.B. Brockhoff).
0950-3293/03/$ - see front matter # 2003 Elsevier Science Ltd. All rights reserved.
doi:10.1016/S0950-3293(03)00007-7
448698914.002.png 448698914.003.png 448698914.004.png
426
P.B. Brockhoff/Food Quality and Preference 14 (2003) 425–434
Fig. 1. The four basic assessor differences for a single sensory attribute. By disagreement we mean all interaction effect not attributable to scaling
differences. As indicated this will include ranking differences (apart fromthe exact inverse ranking that will come up as a scaling effect).
We start out with a discussion of how the acknowl-
edgement of the possible differences illustrated in Fig. 1
leads to the so-called basic assessor model. Next, the
generalized assessor model is presented. This is the key
model for the entire approach. The following two sec-
tions will then more precisely state which significance
tests we performand which individual values we com-
pute. A short section gives the principles of the under-
lying computations without going into details and
finally an example is given showing the features of this
approach using the available SAS 1 macro PANMO-
DEL. Some technical details are given in the Appendix .
the data will lead to a statistical significance test for the
hypothesis of equal variances (‘variability homo-
geneity’). Following this maximum likelihood principle
of testing for effects we can achieve a test for disagree-
ment: we must specify a model including every type of
assessor differences except for the disagreement. This is
the basic assessor model as given a thorough treatment
in Brockhoff and Skovgaard (1994) :
Y asr ¼ a þ a s þ " asr ;
" asr N 0; a
ð3Þ
To explain how this model comes up, let us take the
reference model (2) as starting point and argue step by
step how to construct a model including assessor differ-
encestype1,2and4andnottype3,cf. Fig.1 .Inclusion
ofthe a and 2 -parametersensuresthattype1andtype
4 are present. The s -parameters (sample main effects)
should also be there as we still want the model to
express sample differences. The only thing not reused
from (2) at this point is the interaction term as . First
note again that the assessorsample interaction effect
may be seen as a sum of scaling effects and non-linear
(non-scaling) effects:
2. The basic assessor model
The standard ANOVA model given by (1) accounts
fordifferences oftype 1,2and3,cf. Fig. 1 .The assessor
level differences are the main effects ( a ). Scaling as well
as disagreement differences will come up as interaction
effects ( as ). Variability differences are not accounted
forin (1) : one of the standard assumptions behind usual
ANOVA is variance homogeneity. But it is easy to spe-
cify ‘the same’ model allowing for different assessor
variabilities:
Y asr ¼ a þ s þ as þ " asr ;
" asr N 0; a0
Interaction ¼ Scaling þ Disagreement
and independent
ð2Þ
Instead of the general interaction term as we must spe-
cify a termthat expresses scaling effects only. The dif-
ferences in scaling can be expressed by specifying
individual scaling constants ~ a . Interpreting the sample
maineffects s in (2) astruesamplevaluesofthesensory
property in question (on some common scale) the scal-
ing effect is then the product ~ a s . This should then be
substituted for as in (2) leading to
So this model takes all four listed assessor differences
into account and will indeed serve as a basic ‘reference’
model for the approach of the paper.
Note that the only difference between (1) and (2) is
the subscript on the error variance a 2 . This means that
a comparison of how well each of the two models fits
and independent
448698914.005.png
P.B. Brockhoff/Food Quality and Preference 14 (2003) 425–434
427
Y asr ¼ a þ s þ ~ a s þ " asr
¼ a þ 1þ ~ a
s þ " asr
effects in play. In our context in which information
about the assessor differences is the target,it means that
we can acknowledge the true structure of the data, and
still do all the testing outlined earlier.
For a precise definition of the generalized assessor
model we would have to introduce a notation for the
general linear model. Instead we illustrate how it works
for the example given in the present paper. We assume
now, more realistically, that the replications were car-
ried out in several sessions, and we want to account for
the replication main effects. Replication may here
represent a single session, if all samples are tested in
each session, or a ‘collection’ of sessions in time. The
general assessor model now becomes
Renaming a ¼ ~ a , we obtain the basic assessor
model (3) . This shows that although at first sight, the
sample main effects seem to be missing in the assessor
model, they are indeed present as a part of the scaling
effect.
The sensitivity is defined as the squared ‘signal-to-
noise’ ratio:
Sensitivity a ¼ a = a
Comparing data fits for model (3) with model (2)
gives a test for disagreements. In fact, this comparison
amounts to a comparison of the estimated individual
variances in the two models: ^ a0 and ^ a . The former is
the actual within sample variation for each individual
whereas the latter also include the disagreement effects
(if any). We can thus by a comparison on individual
level obtain individual measures of disagreements as:
Y asr ¼ a þ a s þ r
and independent:
Þ" asr ;
ð3aÞ
" asr N 0; a
This is then a model that allows for multiplicative
interaction between assessors and everything else, but
onlythroughthesameindividualscalingconstants.This
goes together with the following thinking: no matter
what goes on that might change the perception of the
samples, this will eventually ‘go through’ each indivi-
dual’s use of the scale before becoming an actual
observed sensory score. Fromthe point of view of the
present paper this generality does not change anything
regarding the target information: statistical significance
tests for individual differences together with individual
estimates for scaling, variability, disagreement and sen-
sitivity.
The general version of the reference model (2) may
now be written as
q
a a0
Disagreement a ¼
:
Using the assessor model as reference model we will be
able to test for scaling effects by fitting a restricted ver-
sion with no scaling effects. We will also be able to test
for sensitivity differences using a similar approach. And
finally, applying the assessor model for each sensory
attribute will provide estimates for individual scalings,
variabilities, disagreements and sensitivities for each
attribute. These can be used to investigate by summary
statistics and/or plotting the individual differences in
further detail. We return to the interpretation of the
individual parameters in Section 5 . In the basic assessor
modelthere isasimpleone-waysamplestructure.Inthe
following it will be clear that it all generalizes to arbi-
trary complex sample structures without changing the
setup for testing and estimating individual differences.
Y asr ¼ a þ s þ r þ as þ ar þ " asr ;
" asr N 0; a0
and independent
ð2aÞ
when, in the following, we refer to (2a) and (3a) we
think of these as the general definitions of the models.
These general models are really specified by a specifica-
tion of all effects one would include in an ANOVA
model except for all effects related to assessors, we call
these the sample structure effects.
3. The general assessor model
In most realistic situations, the experimental design
structure will have a complexity such that the basic
assessor model will be too simplistic. The complex
structure may include crossed and nested sample treat-
ment effects, order and carry-over effects, session and
blocking effects, etc. As indicated in Brockhoff and
Skovgaard (1994) it is possible to extend the basic
assessor model in such a way that one may include an
arbitrary complex structure while maintaining the
assessor part with individual differences in level, scaling
and variability in the model. In short, the generalization
is carried out by substituting for s in (3) the sumof all
4. Statistical significance tests
We will performsignificance tests for the following
four assessor effects:
(I) Variability differences (4. in Fig. 1 )
(II) Presence of disagreements (3 in Fig. 1 )
(III) Scaling differences (2. in Fig. 1 )
(IV) Sensitivity differences
ð
428
P.B. Brockhoff/Food Quality and Preference 14 (2003) 425–434
More formally this corresponds to the testing of the
following hypotheses:
parameters.Weletnowp,p=1,...,P,beanumberingof
the sensory attributes. We have then as an output from
the macro individual scalings, variabilities (standard
deviations), disagreements and (‘signed’) sensitivities for
each combination of assessor and attribute:
H (I) : 1 ¼¼ A (versus (2a) )
H (II) : Model (3a) holds true (versus (2a) )
Scalings : ^ ap
Standard deviations : ^ a0;p
Disagreements :
H (III) : 1 ¼¼ A (versus (3a) )
H (IV) : 1
1
q
^ ap a0;p
¼¼ A
A
(versus (3a) )
Signed sensitivities : ^ ap =^ ap
The test in (I) is carried out as a Bartlett corrected w 2 -
test as in Brockhoff and Skovgaard (1994). 1 The more
general sample structure does not give any diMculties
for the Bartlett test. The tests in (II), (III) and (IV) are
basedonthemaximumlikelihoodmethod,inwhichtwo
nested models are compared by the likelihood ratio sta-
tistic. We use here approximate F-test statistics as
described in Brockhoff (1998a) .
The test for equal sensitivities in (IV) is new in this
context. In Brockhoff and Skovgaard (1994) it was
shown that the quantities 2 / 2 are measures of indivi-
dual sensitivities that relate closely to F-test statistics
fromindividually performed ANOVAs. This, however,
is only true if in fact there are no disagreements. We
also see that the tests in (III) and (IV) are carried out
‘versus (3a) ’.Fromastrictlystatisticalpointofviewthis
means that we assume that the assessor model (3a)
holds truly, and under that assumption we test the scal-
ing and sensitivity differences. However, if disagreement
is present the tests still make sense, but the interpreta-
tionofscaling(andconsequentlyofsensitivity)changes:
it has now a ‘consensus’ interpretation. If an individual
does not agree with the panel as a whole he/she will get
lower scaling and sensitivity values. To understand this,
onemaytakealookatthebasicassessormodel (3) .The
scalings appear as individual regression coeMcients in
regressions on the common (‘consensus’) values of the
samples s . If an individual ranks entirely different than
the consensus he/she may very well obtain an individual
regression coeMcient of zero although he/she uses all of
the scale. And as the sensitivities are direct functions of
thescalingsthesameinterpretationapplies:anindividual
may be a good ‘discriminator’as measured by the indivi-
dual F-test statistic,but ifhis/herranksof discrimination
donotagreewiththepanelasawhole,he/shewillnotbe
called‘sensitive’inthesenseofthepresentpaper.
The ‘signed’ sensitivities will have the same sign as the
scaling factor. Note that the standard deviations are the
realindividualvariabilitymeasures,thatis,theindividually
sizedresidualsfromthereferencemodel (2a) .Thesefigures
summarize whatgoeson in the panel with respect to scal-
ing, variability, disagreement and sensitivity. The inter-
pretations should be combined with the outcomes of the
significancetests.Twodifferentaimsemergeatthispoint:
Within panel comparisons of the assessors.
Between panel comparisons, when several panels
evaluated the same kind of samples.
And the possible tools for this may be categorized as:
1. Tabulation.
2. Plotting.
3. Further statistical analysis.
It is beyond the scope of this paper to cover all six
combinations of aims and tools. A few tables and plots
for the within panel comparisons of the assessors are
given. For each of the four effects an assessor-by-attri-
bute table is presented. If the effect is significant for an
attribute the values for the two most extreme assessors
are given. Biplots of the full matrix of individual values
are used subsequently.
6. Computations
In order to obtain the tests of the previous section we
must be able to maximize the likelihood for each of the
four models:
5. Individual parameters
1. Model (2a)
2. Model (3a)
3. Model (3a) with the additional restriction that
1 ¼¼ A .
4. Model (3a) with the additional restriction that
1 = 1 ¼¼ A = A .
In addition to the four significance tests an important
source of information are the estimated individual
1 The formula for the Bartlett test in Brockhoff and Skovgaard
(1994) has a misprint: The symbols R and P must be interchanged.
For 1, 2 and 3 this was derived for the basic assessor
model in Brockhoff and Skovgaard (1994) and this can
P.B. Brockhoff/Food Quality and Preference 14 (2003) 425–434
429
be generalized directly to the more general setting of the
present paper. The model (2a) is simply fitted by fitting
for each assessor the model given by the sample struc-
ture effects. The general assessor model (3a) is fitted by
a type of alternating least squares algorithmalternating
between individual regressions (for ‘known’ values of
the sample structure effects) and weighted ANOVAs
(for ‘known’ values of the individual parameters a , a
and a ). The model (3a) with the homogeneous scaling
restriction 1 == A is fitted by a simplified version
of the alternating algorithmjust outlined. In this case
the algorithmcorresponds to an iterative weighted least
squares procedure. The derivation of an algorithmfor
the model (3a) with the homogeneous sensitivity
restriction 1 = 1 ¼¼ A = A requires a little con-
sideration, but follows along the same lines. Details are
given in the Appendix .
All computations presented in the paper are imple-
mented as the SAS 1 macro PANMODEL that is avail-
able through the author’s homepage: http://
www.dina.kvl.dk/per.The later examplewillillustrate
what the macro provides and what it can do. No exact
SAS 1 syntax will be given as this may be subject to
change over time. The macro is self-explanatory.
take out an assessor for a given attribute, if the assessor
has exactly the same score (possibly missing) for all
observations on that attribute. It makes no sense to
look for a scaling and variability value for such an
assessor. This data check makes the macro robust
against programbreak downs for ‘messy’ data set
(imbalance, lots of missing values, low information
data, too many specified attributes, etc.).
The current version of the PANMODEL macro pro-
vides first of all a table corresponding to Table 1 .In
Table 1 the main significance results are presented. We
see that for all attributes except for mealiness and juici-
ness there is a significant difference (on 5% level) in the
assessor’s variability. Whenever not specified we use the
5% test level. There are only significant disagreement
problems in few cases, but the assessors do indeed use
the scale differently for almost all attributes (except for
pea odour). And perhaps the most interesting of them
all: for all attributes except for pea taste the assessors
have different sensitivities. The assessors did vary (sig-
nificantly) in their variability on pea taste and they also
differed in the scaling, but apparently these two effects
cancel each other out such that the sensitivites are not
significantly different. For mealiness and juiciness, for
which the variabilities were homogeneous, the differ-
ences in scaling imply differences in the sensitivity.
Next the PANMODEL macro gives for each of the
four features tables of the most extreme individuals. In
Table 2 we see the least and the most reproducible
assessor for each attribute when in fact significant dif-
ferences exist. These standard deviations should be
interpreted in light of the line scale [0,15]. A standard
deviation of 2.10 for pea odour assed by assessor 6
expresses that the average deviation fromhis/her aver-
age value for replications of identical samples is 2.10 for
assessor 6. Using the standard normal population
7. Example
We consider here data froma study on frozen peas,
see Bech, Hansen, and Wienberg (1997) . Sixteen pea
samples were profiled by a panel of 11 assessors. They
were evaluated in three replicates using 14 descriptors
measured by a continuous line scale from 0 to 15. We
applied the PANMODEL macro with the sample main
effects on 16 levels and the replication main effect on
three levels as the sample structure effects, cf. (2a) and
(3a) . The current version of the macro PANMODEL
takes as input a SAS 1 data set with the entire sensory
dataasitwouldbeprepared toperformregularanalysis
of variances for each attribute. The output fromPAN-
MODEL is three plain text-files (ASCII):
Table 1
Summary of statistical tests of individual differences
Attribute Variability Disagreement Scaling Sensitivity
Pea odour ***
***
1. A log-file with results of data checks, iteration
convergence information and detailed sig-
nificance test results.
2. A result file with summary tables of the sig-
nificance tests and extreme individuals.
3. Adatafilewiththeestimatedindividualstandard
deviations, disagreements, scalings and sensitiv-
ities for each assessor and attribute.
Pod odour **
** ***
Sweet odour ***
** ***
Earthy odour ***
*** ***
Pea taste
***
**
*
Pod taste
**
*
*** ***
Sweet taste *
*
* *
Bitterness ***
*** **
Earthy taste ***
*
*** ***
Crispiness ***
** *** ***
Juiciness
*** ***
Hardness ***
*** ***
The data check is an automated procedure that
checks whether there are some ‘singularities’ in the data
thatmakeitimpossibletocarry out theanalyses.Andif
this is the case the ‘singularities’ are removed and the
removal registered in the log-file. For instance, it will
Mealiness
*** ***
Skin viscosity **
** ***
* 5% significance.
** 1% significance.
*** 0.1% significance.
448698914.001.png
Zgłoś jeśli naruszono regulamin