SPSSTWOSTEPCLUSTER–AFIRSTEVALUATION?
JohannBacher?,KnutWenzig?,MelanieVogler§
Universit¨atErlangen-N¨urnberg
SPSS11.5andlaterreleasesofferatwostepclusteringmethod.Accordingtotheauthors’
knowledgetheprocedurehasnotbeenusedinthesocialsciencesuntilnow.Thissituationissurprising:Thewidelyusedclusteringalgorithms,k-meansclusteringandagglomerativehierarchicaltechniques,sufferfromwellknownproblems,whereasSPSSTwoStepclusteringpromisestosolveatleastsomeoftheseproblems.Inparticular,mixedtypeattributescanbehandledandthenumberofclustersisautomaticallydetermined.Thesepropertiesarepromising.Therefore,SPSSTwoStepclusteringisevaluatedinthispaperbyasimulationstudy.
Summarizingtheresultsofthesimulations,SPSSTwoStepperformswellifallvariablesarecontinuous.Theresultsarelesssatisfactory,ifthevariablesareofmixedtype.Onereasonforthisunsatisfactory?ndingisthefactthatdifferencesincategoricalvariablesaregivenahigherweightthandifferencesincontinuousvariables.Differentcombinationsofthecategor-icalvariablescandominatetheresults.Inaddition,SPSSTwoStepclusteringisnotabletodetectcorrectlymodelswithnoclustersolutions.Latentclassmodelsshowabetterperfor-mance.Theyareabletodetectmodelswithnounderlyingclusterstructure,theyresultmorefrequentlyincorrectdecisionsandinlessunbiasedestimators.
Keywords:
SPSSTwoStepclustering,mixedtypeattributes,modelbasedclustering,latentclassmodels1INTRODUCTION
SPSS11.5andlaterreleasesofferatwostepclusteringmethod(SPSS2001,2004).Accordingtotheauthors’knowledgetheprocedurehasnotbeenusedinthesocialsciencesuntilnow.Thissituationissurprising:Thewidelyusedclusteringalgorithms,k-meansclusteringandagglomerativehierarchicaltechniques,sufferfromwellknownproblems(forinstance,Bacher2000:223;Everittetal.2001:94-96;Huang1998:288),whereasSPSSTwoStepclusteringpromisestosolveatleastsomeoftheseproblems.Inparticular,mixedtypeattributescanbehandledandthenumberofclustersisautomaticallydetermined.Thesepropertiesarepromising.
?AUTHORS’NOTE:ThestudywassupportedbytheStaedtlerStiftungN¨urnberg(Project:WasleistenClus-”
teranalyseprogramme?EinsystematischerVergleichvonProgrammenzurClusteranalyse“).WewouldliketothankSPSSInc.(TechnicalSupport),JeroenVermuntandDavidWishartforinvaluablecommentsonanearlierdraftofthepaperandBettinaLampmann-EndeforherhelpwiththeEnglishversion.?bacher@wiso.uni-erlangen.de?knut@wenzig.de§vogler.m@gmx.de
1
2SPSSTWOSTEPCLUSTERINGTherefore,SPSSTwoStepclusteringwillbeevaluatedinthispaper.Thefollowingquestionswillbeanalyzed:
1.Howistheproblemofcommensurability(differentscaleunits,differentmeasurementlevels)solved?2.Whichassumptionsaremadeformodelswithmixedtypeattributes?
3.HowwelldoesSPSSTwoStep–especiallytheautomaticdetectionofthenumberofclusters–performinthecaseofcontinuousvariables?4.HowwelldoesSPSSTwoStep–especiallytheautomaticdetectionofthenumberofclusters–performinthecaseofvariableswithdifferentmeasurementlevels(mixedtypeattributes)?ThemodelofSPSSTwoStepclusteringwillbedescribedinthenextsection.Theevaluationwillbedoneinsection3.Section4
2SPSSTWOSTEPCLUSTERINGwhere
??
ξi=?ni
??
1
?i2j+σ?2?ijl)?ijllog(π∑2log(σj)?∑∑πj=1j=1l=1
p
q
mj
pq
mj
??
(2)
??
(3)(4)
1
?2?sjl)?sjllog(π?s2j+σ∑2log(σj)?∑∑πj=1j=1l=1
????
qmjp
1
?2???i,s??jl)???i,s??jllog(π???2i,s??j+σξ??i,s??=?n??i,s??∑log(σj)?∑∑πj=12j=1l=1
ξs=?ns
ξvcanbeinterpretedasakindofdispersion(variance)withinclusterv(v=i,s,??i,s??).ξv
12+σ?v?2consistsoftwoparts.The?rstpart?nv∑2log(σjj)measuresthedispersionofthecon-2wouldbeused,d(i,s)wouldbeexactlythe?vtinuousvariablesxjwithinclusterv.Ifonlyσj
?2decreaseinthelog-likelihoodfunctionaftermergingclusteriands.Thetermσjisaddedto
mqj2=0.Theentropy?n?vjllog(π?v?vjl)isusedavoidthedegeneratingsituationforσv∑j=1∑l=1πj
inthesecondpartasameasureofdispersionforthecategoricalvariables.
Similartoagglomerativehierarchicalclustering,thoseclusterswiththesmallestdistanced(i,s)aremergedineachstep.Thelog-likelihoodfunctionforthestepwithkclustersiscom-putedas
lk=
v=1
∑ξv.
k
(5)
Thefunctionlkisnottheexactlog-likelihoodfunction(seeabove).Thefunctioncanbeinterpretedasdispersionwithinclusters.Ifonlycategoricalvariablesareused,lkistheentropywithinkclusters.
Numberofclusters.Thenumberofclusterscanbeautomaticallydetermined.Atwophaseestimatorisused.Akaike’sInformationCriterion
AICk=?2lk+2rk
whererkisthenumberofindependentparametersorBayesianInformationCriterion
BICk=?2lk+rklogn
(7)(6)
iscomputedinthe?rstphase.BICkorAICkresultinagoodinitialestimateofthemaximumnumberofclusters(Chiuetal.2001:266).ThemaximumnumberofclustersissetequaltonumberofclusterswheretheratioBICk/BIC1issmallerthanc1(currentlyc1=0.04)1forthe?rsttime(personalinformationofSPSSTechnicalSupport).Intable1thisisthecaseforelevenclusters.
ThesecondphaseusestheratiochangeR(k)indistanceforkclusters,de?nedas
R(k)=dk?1/dk,
(8)
1ThevalueisbasedonsimulationstudiesoftheauthorsofSPSSTwoStepClustering.(personalinformationofSPSSTechnicalSupport,2004-05-24)
3
3EVALUATIONwheredk?1isthedistanceifkclustersaremergedtok?1clusters.Thedistancedkisde?nedsimilarly.2Thenumberofclustersisobtainedforthesolutionwhereabigjumpoftheratiochangeoccurs.3
Theratiochangeiscomputedas
R(k1)/R(k2)(11)forthetwolargestvaluesofR(k)(k=1,2,...,kmax;kmaxobtainedfromthe?rststep).Ifthe
ratiochangeislargerthanthethresholdvaluec2(currentlyc2=1.154)thenumberofclustersissetequaltok1,otherwisethenumberofclustersissetequaltothesolutionwithmax(k1,k2).Intable1,thetwolargestvaluesofR(k)arereportedforthreeclusters(R(3)=2.129;largestvalue)andforeightclusters(R(8)=1.952).Theratiois1.091andsmallerthanthethresholdvalueof1.15.Hencethemaximumof3resp.8isselectedasthebestsolution.
Clustermembershipassignment.Eachobjectisassigneddeterministicallytotheclosestclus-teraccordingtothedistancemeasureusedto?ndtheclusters.Thedeterministicassignmentmayresultinbiasedestimatesoftheclusterpro?lesiftheclustersoverlap(Bacher1996:311–314,Bacher2000).
Modi?cation.Theprocedureallowstode?neanoutliertreatment.Theusermustspecifyavalueforthefractionofnoise,e.g.5(=5%).Aleaf(pre-cluster)isconsideredasapotentialoutlierclusterifthenumberofcasesislessthanthede?nedfractionofthemaximumclustersize.Outliersareignoredinthesecondstep.
Output.Comparedtok-meansalgorithm(QUICKCLUSTER)oragglomerativehierarchicaltechniques(CLUSTER),SPSShasimprovedtheoutputsigni?cantly.Anadditionalmodulallowstostatisticallytestthein?uenceofvariablesontheclassi?cationandtocomputecon?dencelevels.
3EVALUATION3.1Commensurability
Clusteringtechniques(k-means-clustering,hierarchicaltechniquesetc.)requirecommensu-rablevariables(forinstance,Fox1982).Thisimpliesintervalorratioscaledvariableswithequalscaleunits.Inthecaseofdifferentscaleunits,thevariablesareusuallystandardizedbytherange(normalizedtotherange[0,1],rangeweighted)orz-transformedtohavezeromeanandunitstandarddeviation(autoscaling,standardscoring,standarddeviationweights).Ifthevariableshavedifferentmeasurementlevels,eitherageneraldistancemeasure(likeGower’sgeneralsimilaritymeasure;Gower1971)maybeusedorthenominal(andordinal)variablesmaybetransformedtodummiesandtreatedasquantitative5(Benderetal.2001;Wishart2003).
2Thedistancesdkcanbecomputedfromtheoutputinthefollowingway:
dklv
=lk?1?lk
=(rvlogn?BICv)/2
or
lv=(2rv?AICv)/2for
v=k,k?1
(9)(10)
However,usingBICorAICresultsindifferentsolutions.
3Theexactdecisionrulesaredescribedvaguelyintherelevantliteratureandthesoftwaredocumentation(SPSS2001;Chiuetal.2001).Therefore,wereporttheexactdecisionrulebasedonpersonalinformationofSPSSTechnicalSupport.Adocumentationintheoutput,like“solutionxwasselectedbecause...”,wouldbehelpfullfortheuser.
4Likec1,c2isbasedonsimulationstudiesoftheauthorsofSPSSTwoStepClustering.(personalinformationofSPSSTechnicalSupport,2004-05-24)
5Theterm“quantitative”willbeusedforintervalorratioscaledvariables.
4
相关推荐: