Statistical testing of individual differences in sensory profiling.pdf

Food Quality and Preference 14 (2003) 425–434

www.elsevier.com/locate/foodqual

Statistical testing of individual diﬀerences in sensory proﬁling

Per Bruun Brockhoﬀ *

Department of Mathematics and Physics, Royal Veterinary and Agricultural University, Thorvaldsensvej 40, DK-1871 Frederiksberg C, Denmark

Received 29 July 2002; received in revised form 21 November 2002; accepted 28 November 2002

Abstract

A generalization of the approach of Brockhoﬀ and Skovgaard [Food Quality & Preference, 5 (1994), 215] is given. The emphasis

is on univariate assessor performance in sensory proﬁling. Statistical signiﬁcance tests for diﬀerence between assessors of scaling,

variability and sensitivity will be given. A test for disagreement eﬀect is also presented. In addition the approach will provide indi-

vidual scaling, variability, disagreement and sensitivity values, that can be used for subsequent tabulation, plotting and statistical

analysis. The method of maximum likelihood is used throughout and all computations are implemented in a SAS 1 Macro PAN-

MODEL that is available via the author’s homepage: http://www.dina.kvl.dk/ per.

Keywords: Analysisof variance; Assessor performance; Individual diﬀerences; Maximum likelihood analysis; Multipicative interaction; SAS macro;

Sensory proﬁle data; Statistical testing

1. Introduction

sample s, where r=1,...,R, a=1,...,A and s=1,..., S.

The standard ANOVA model with sample and assessor

main eﬀects together with the sampleassessor interac-

tion eﬀect may be written as

Y asr ¼ a þ s þ as þ " asr ;

" asr N 0; 2

The investigation of individual diﬀerences in sensory

proﬁling has been the topic of several papers over the

years, see, for example, Lundahl and McDaniel (1990,

1991), Naes and Solheim(1991), and Brockhoﬀ and

Skovgaard (1994) . The present paper is a continuation

of the latter with two main novel contributions: (1)

Providing a statistical signiﬁcance test for sensitivity

diﬀerences. (2) Presenting the entire approach as a

practicaltoolfortheanalysisofindividualdiﬀerencesin

a full sensory proﬁle data set supported by a publically

availableSAS 1 macro.Aconceptualdiﬀerence between

the presented approach and other approaches is that

formal statistical models are speciﬁed and formal max-

imum likelihood theory is used. Although earlier meth-

ods also provided signiﬁcance tests for certain eﬀects we

give here a more comprehensive collection of formally

performed likelihood ratio tests.

Consider an over-simpliﬁed sensory experiment with

no session eﬀects. Assume also that we have a simple

oneway sample structure, that is, the samples represent

only one single ‘treatment factor’. Let Y asr denote the

rth replicate of a sensory score given by assessor a for

and independent

ð1Þ

The application of this model has the purpose to get

information about the samples ( s ) but accounting

properly for possible assessor diﬀerences. For this we

would typically assume that a and as are random

variables leading to a so-called mixed model ANOVA.

Viewpoints on how to performan ANOVA properly in

the sensory context are given in the papers Steinsholt

(1998), Naes and Langsrud (1998) and discussion

papers, for example, Brockhoﬀ (1998b) . This is not the

issueofthepresentapproach.Insteadthefocusisonthe

individual diﬀerences. Fig. 1 illustrates what diﬀerences

we may encounter on a univariate scale. Together, scal-

ing diﬀerences and disagreements constitute the asses-

sorsample interaction. The disagreement part of the

interaction is, apart fromthe exact inverse ranking,

what Stone and Sidel (1985) refers to as ‘cross-over’

interaction. And the scaling part corresponds to ‘mag-

nitudeinteraction’intheterminologyof StoneandSidel

(1985) .

* Tel.: +45-35-28-23-61; fax: +45-35-28-23-50.

E-mail address: pmb@kvl.dk (P.B. Brockhoﬀ).

doi:10.1016/S0950-3293(03)00007-7

426

P.B. Brockhoﬀ/Food Quality and Preference 14 (2003) 425–434

Fig. 1. The four basic assessor diﬀerences for a single sensory attribute. By disagreement we mean all interaction eﬀect not attributable to scaling

diﬀerences. As indicated this will include ranking diﬀerences (apart fromthe exact inverse ranking that will come up as a scaling eﬀect).

We start out with a discussion of how the acknowl-

edgement of the possible diﬀerences illustrated in Fig. 1

leads to the so-called basic assessor model. Next, the

generalized assessor model is presented. This is the key

model for the entire approach. The following two sec-

tions will then more precisely state which signiﬁcance

tests we performand which individual values we com-

pute. A short section gives the principles of the under-

lying computations without going into details and

ﬁnally an example is given showing the features of this

approach using the available SAS 1 macro PANMO-

DEL. Some technical details are given in the Appendix .

the data will lead to a statistical signiﬁcance test for the

hypothesis of equal variances (‘variability homo-

geneity’). Following this maximum likelihood principle

of testing for eﬀects we can achieve a test for disagree-

ment: we must specify a model including every type of

assessor diﬀerences except for the disagreement. This is

the basic assessor model as given a thorough treatment

in Brockhoﬀ and Skovgaard (1994) :

Y asr ¼ a þ a s þ " asr ;

" asr N 0; a

ð3Þ

To explain how this model comes up, let us take the

reference model (2) as starting point and argue step by

step how to construct a model including assessor diﬀer-

encestype1,2and4andnottype3,cf. Fig.1 .Inclusion

ofthe a and 2 -parametersensuresthattype1andtype

4 are present. The s -parameters (sample main eﬀects)

should also be there as we still want the model to

express sample diﬀerences. The only thing not reused

from (2) at this point is the interaction term as . First

note again that the assessorsample interaction eﬀect

may be seen as a sum of scaling eﬀects and non-linear

(non-scaling) eﬀects:

2. The basic assessor model

The standard ANOVA model given by (1) accounts

fordiﬀerences oftype 1,2and3,cf. Fig. 1 .The assessor

level diﬀerences are the main eﬀects ( a ). Scaling as well

as disagreement diﬀerences will come up as interaction

eﬀects ( as ). Variability diﬀerences are not accounted

forin (1) : one of the standard assumptions behind usual

ANOVA is variance homogeneity. But it is easy to spe-

cify ‘the same’ model allowing for diﬀerent assessor

variabilities:

Y asr ¼ a þ s þ as þ " asr ;

" asr N 0; a0

Interaction ¼ Scaling þ Disagreement

and independent

ð2Þ

Instead of the general interaction term as we must spe-

cify a termthat expresses scaling eﬀects only. The dif-

ferences in scaling can be expressed by specifying

individual scaling constants ~ a . Interpreting the sample

maineﬀects s in (2) astruesamplevaluesofthesensory

property in question (on some common scale) the scal-

ing eﬀect is then the product ~ a s . This should then be

substituted for as in (2) leading to

So this model takes all four listed assessor diﬀerences

into account and will indeed serve as a basic ‘reference’

model for the approach of the paper.

Note that the only diﬀerence between (1) and (2) is

the subscript on the error variance a 2 . This means that

a comparison of how well each of the two models ﬁts

and independent

P.B. Brockhoﬀ/Food Quality and Preference 14 (2003) 425–434

427

Y asr ¼ a þ s þ ~ a s þ " asr

¼ a þ 1þ ~ a

s þ " asr

eﬀects in play. In our context in which information

about the assessor diﬀerences is the target,it means that

we can acknowledge the true structure of the data, and

still do all the testing outlined earlier.

For a precise deﬁnition of the generalized assessor

model we would have to introduce a notation for the

general linear model. Instead we illustrate how it works

for the example given in the present paper. We assume

now, more realistically, that the replications were car-

ried out in several sessions, and we want to account for

the replication main eﬀects. Replication may here

represent a single session, if all samples are tested in

each session, or a ‘collection’ of sessions in time. The

general assessor model now becomes

Renaming a ¼ 1þ ~ a , we obtain the basic assessor

model (3) . This shows that although at ﬁrst sight, the

sample main eﬀects seem to be missing in the assessor

model, they are indeed present as a part of the scaling

eﬀect.

The sensitivity is deﬁned as the squared ‘signal-to-

noise’ ratio:

Sensitivity a ¼ a = a

Comparing data ﬁts for model (3) with model (2)

gives a test for disagreements. In fact, this comparison

amounts to a comparison of the estimated individual

variances in the two models: ^ a0 and ^ a . The former is

the actual within sample variation for each individual

whereas the latter also include the disagreement eﬀects

(if any). We can thus by a comparison on individual

level obtain individual measures of disagreements as:

Y asr ¼ a þ a s þ r

and independent:

Þ" asr ;

ð3aÞ

" asr N 0; a

This is then a model that allows for multiplicative

interaction between assessors and everything else, but

onlythroughthesameindividualscalingconstants.This

goes together with the following thinking: no matter

what goes on that might change the perception of the

samples, this will eventually ‘go through’ each indivi-

dual’s use of the scale before becoming an actual

observed sensory score. Fromthe point of view of the

present paper this generality does not change anything

regarding the target information: statistical signiﬁcance

tests for individual diﬀerences together with individual

estimates for scaling, variability, disagreement and sen-

sitivity.

The general version of the reference model (2) may

now be written as

a a0

Disagreement a ¼

Using the assessor model as reference model we will be

able to test for scaling eﬀects by ﬁtting a restricted ver-

sion with no scaling eﬀects. We will also be able to test

for sensitivity diﬀerences using a similar approach. And

ﬁnally, applying the assessor model for each sensory

attribute will provide estimates for individual scalings,

variabilities, disagreements and sensitivities for each

attribute. These can be used to investigate by summary

statistics and/or plotting the individual diﬀerences in

further detail. We return to the interpretation of the

individual parameters in Section 5 . In the basic assessor

modelthere isasimpleone-waysamplestructure.Inthe

following it will be clear that it all generalizes to arbi-

trary complex sample structures without changing the

setup for testing and estimating individual diﬀerences.

Y asr ¼ a þ s þ r þ as þ ar þ " asr ;

" asr N 0; a0

and independent

ð2aÞ

when, in the following, we refer to (2a) and (3a) we

think of these as the general deﬁnitions of the models.

These general models are really speciﬁed by a speciﬁca-

tion of all eﬀects one would include in an ANOVA

model except for all eﬀects related to assessors, we call

these the sample structure eﬀects.

3. The general assessor model

In most realistic situations, the experimental design

structure will have a complexity such that the basic

assessor model will be too simplistic. The complex

structure may include crossed and nested sample treat-

ment eﬀects, order and carry-over eﬀects, session and

blocking eﬀects, etc. As indicated in Brockhoﬀ and

Skovgaard (1994) it is possible to extend the basic

assessor model in such a way that one may include an

arbitrary complex structure while maintaining the

assessor part with individual diﬀerences in level, scaling

and variability in the model. In short, the generalization

is carried out by substituting for s in (3) the sumof all

4. Statistical signiﬁcance tests

We will performsigniﬁcance tests for the following

four assessor eﬀects:

(I) Variability diﬀerences (4. in Fig. 1 )

(II) Presence of disagreements (3 in Fig. 1 )

(III) Scaling diﬀerences (2. in Fig. 1 )

(IV) Sensitivity diﬀerences

428

P.B. Brockhoﬀ/Food Quality and Preference 14 (2003) 425–434

More formally this corresponds to the testing of the

following hypotheses:

parameters.Weletnowp,p=1,...,P,beanumberingof

the sensory attributes. We have then as an output from

the macro individual scalings, variabilities (standard

deviations), disagreements and (‘signed’) sensitivities for

each combination of assessor and attribute:

H (I) : 1 ¼¼ A (versus (2a) )

H (II) : Model (3a) holds true (versus (2a) )

Scalings : ^ ap

Standard deviations : ^ a0;p

Disagreements :

H (III) : 1 ¼¼ A (versus (3a) )

H (IV) : 1

^ ap a0;p

¼¼ A

(versus (3a) )

Signed sensitivities : ^ ap =^ ap

The test in (I) is carried out as a Bartlett corrected w 2 -

test as in Brockhoﬀ and Skovgaard (1994). 1 The more

general sample structure does not give any diMculties

for the Bartlett test. The tests in (II), (III) and (IV) are

basedonthemaximumlikelihoodmethod,inwhichtwo

nested models are compared by the likelihood ratio sta-

tistic. We use here approximate F-test statistics as

described in Brockhoﬀ (1998a) .

The test for equal sensitivities in (IV) is new in this

context. In Brockhoﬀ and Skovgaard (1994) it was

shown that the quantities 2 / 2 are measures of indivi-

dual sensitivities that relate closely to F-test statistics

fromindividually performed ANOVAs. This, however,

is only true if in fact there are no disagreements. We

also see that the tests in (III) and (IV) are carried out

‘versus (3a) ’.Fromastrictlystatisticalpointofviewthis

means that we assume that the assessor model (3a)

holds truly, and under that assumption we test the scal-

ing and sensitivity diﬀerences. However, if disagreement

is present the tests still make sense, but the interpreta-

tionofscaling(andconsequentlyofsensitivity)changes:

it has now a ‘consensus’ interpretation. If an individual

does not agree with the panel as a whole he/she will get

lower scaling and sensitivity values. To understand this,

onemaytakealookatthebasicassessormodel (3) .The

scalings appear as individual regression coeMcients in

regressions on the common (‘consensus’) values of the

samples s . If an individual ranks entirely diﬀerent than

the consensus he/she may very well obtain an individual

regression coeMcient of zero although he/she uses all of

the scale. And as the sensitivities are direct functions of

thescalingsthesameinterpretationapplies:anindividual

may be a good ‘discriminator’as measured by the indivi-

dual F-test statistic,but ifhis/herranksof discrimination

donotagreewiththepanelasawhole,he/shewillnotbe

called‘sensitive’inthesenseofthepresentpaper.

The ‘signed’ sensitivities will have the same sign as the

scaling factor. Note that the standard deviations are the

realindividualvariabilitymeasures,thatis,theindividually

sizedresidualsfromthereferencemodel (2a) .Theseﬁgures

summarize whatgoeson in the panel with respect to scal-

ing, variability, disagreement and sensitivity. The inter-

pretations should be combined with the outcomes of the

signiﬁcancetests.Twodiﬀerentaimsemergeatthispoint:

Within panel comparisons of the assessors.

Between panel comparisons, when several panels

evaluated the same kind of samples.

And the possible tools for this may be categorized as:

1. Tabulation.

2. Plotting.

3. Further statistical analysis.

It is beyond the scope of this paper to cover all six

combinations of aims and tools. A few tables and plots

for the within panel comparisons of the assessors are

given. For each of the four eﬀects an assessor-by-attri-

bute table is presented. If the eﬀect is signiﬁcant for an

attribute the values for the two most extreme assessors

are given. Biplots of the full matrix of individual values

are used subsequently.

6. Computations

In order to obtain the tests of the previous section we

must be able to maximize the likelihood for each of the

four models:

5. Individual parameters

1. Model (2a)

2. Model (3a)

3. Model (3a) with the additional restriction that

1 ¼¼ A .

4. Model (3a) with the additional restriction that

1 = 1 ¼¼ A = A .

In addition to the four signiﬁcance tests an important

source of information are the estimated individual

1 The formula for the Bartlett test in Brockhoﬀ and Skovgaard

(1994) has a misprint: The symbols R and P must be interchanged.

For 1, 2 and 3 this was derived for the basic assessor

model in Brockhoﬀ and Skovgaard (1994) and this can

P.B. Brockhoﬀ/Food Quality and Preference 14 (2003) 425–434

429

be generalized directly to the more general setting of the

present paper. The model (2a) is simply ﬁtted by ﬁtting

for each assessor the model given by the sample struc-

ture eﬀects. The general assessor model (3a) is ﬁtted by

a type of alternating least squares algorithmalternating

between individual regressions (for ‘known’ values of

the sample structure eﬀects) and weighted ANOVAs

(for ‘known’ values of the individual parameters a , a

and a ). The model (3a) with the homogeneous scaling

restriction 1 == A is ﬁtted by a simpliﬁed version

of the alternating algorithmjust outlined. In this case

the algorithmcorresponds to an iterative weighted least

squares procedure. The derivation of an algorithmfor

the model (3a) with the homogeneous sensitivity

restriction 1 = 1 ¼¼ A = A requires a little con-

sideration, but follows along the same lines. Details are

given in the Appendix .

All computations presented in the paper are imple-

mented as the SAS 1 macro PANMODEL that is avail-

able through the author’s homepage: http://

www.dina.kvl.dk/per.The later examplewillillustrate

what the macro provides and what it can do. No exact

SAS 1 syntax will be given as this may be subject to

change over time. The macro is self-explanatory.

take out an assessor for a given attribute, if the assessor

has exactly the same score (possibly missing) for all

observations on that attribute. It makes no sense to

look for a scaling and variability value for such an

assessor. This data check makes the macro robust

against programbreak downs for ‘messy’ data set

(imbalance, lots of missing values, low information

data, too many speciﬁed attributes, etc.).

The current version of the PANMODEL macro pro-

vides ﬁrst of all a table corresponding to Table 1 .In

Table 1 the main signiﬁcance results are presented. We

see that for all attributes except for mealiness and juici-

ness there is a signiﬁcant diﬀerence (on 5% level) in the

assessor’s variability. Whenever not speciﬁed we use the

5% test level. There are only signiﬁcant disagreement

problems in few cases, but the assessors do indeed use

the scale diﬀerently for almost all attributes (except for

pea odour). And perhaps the most interesting of them

all: for all attributes except for pea taste the assessors

have diﬀerent sensitivities. The assessors did vary (sig-

niﬁcantly) in their variability on pea taste and they also

diﬀered in the scaling, but apparently these two eﬀects

cancel each other out such that the sensitivites are not

signiﬁcantly diﬀerent. For mealiness and juiciness, for

which the variabilities were homogeneous, the diﬀer-

ences in scaling imply diﬀerences in the sensitivity.

Next the PANMODEL macro gives for each of the

four features tables of the most extreme individuals. In

Table 2 we see the least and the most reproducible

assessor for each attribute when in fact signiﬁcant dif-

ferences exist. These standard deviations should be

interpreted in light of the line scale [0,15]. A standard

deviation of 2.10 for pea odour assed by assessor 6

expresses that the average deviation fromhis/her aver-

age value for replications of identical samples is 2.10 for

assessor 6. Using the standard normal population

7. Example

We consider here data froma study on frozen peas,

see Bech, Hansen, and Wienberg (1997) . Sixteen pea

samples were proﬁled by a panel of 11 assessors. They

were evaluated in three replicates using 14 descriptors

measured by a continuous line scale from 0 to 15. We

applied the PANMODEL macro with the sample main

eﬀects on 16 levels and the replication main eﬀect on

three levels as the sample structure eﬀects, cf. (2a) and

(3a) . The current version of the macro PANMODEL

takes as input a SAS 1 data set with the entire sensory

dataasitwouldbeprepared toperformregularanalysis

of variances for each attribute. The output fromPAN-

MODEL is three plain text-ﬁles (ASCII):

Table 1

Summary of statistical tests of individual diﬀerences

Attribute Variability Disagreement Scaling Sensitivity

Pea odour ***

***

1. A log-ﬁle with results of data checks, iteration

convergence information and detailed sig-

niﬁcance test results.

2. A result ﬁle with summary tables of the sig-

niﬁcance tests and extreme individuals.

3. Adataﬁlewiththeestimatedindividualstandard

deviations, disagreements, scalings and sensitiv-

ities for each assessor and attribute.

Pod odour **

** ***

Sweet odour ***

** ***

Earthy odour ***

*** ***

Pea taste

***

Pod taste

*** ***

Sweet taste *

* *

Bitterness ***

*** **

Earthy taste ***

*** ***

Crispiness ***

** *** ***

Juiciness

*** ***

Hardness ***

*** ***

The data check is an automated procedure that

checks whether there are some ‘singularities’ in the data

thatmakeitimpossibletocarry out theanalyses.Andif

this is the case the ‘singularities’ are removed and the

removal registered in the log-ﬁle. For instance, it will

Mealiness

*** ***

Skin viscosity **

** ***

* 5% signiﬁcance.

** 1% signiﬁcance.

*** 0.1% signiﬁcance.

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: