Efficient and Effective Clustering Methods for Spatial Data Mining.pdf

Efficient and Effective Clustering Methods for Spatial

Data Mining

Raymond T. Ng

Department of Computer Science

University of British Columbia

Vancouver, B.C., V6T 124, Canada

rng@cs.ubc.ca

Jiawei Han

School of Computing Sciences

Simon Fraser University

Burnaby, B.C., V5A lS6, Canada

hanQcs.sfu.ca

Abstract

amounts (usually, tersbytes) of spatial data that may

be obtained from satellite images, medical equipments,

video cameras, etc., it is costly and often unrealistic

for users to examine spatial data in detail. Spatial

data mining aims to automate such a knowledge dis-

covery process. Thus, it plays an important role in a)

extracting interesting spatial patterns and features; b)

capturing intrinsic relationships between spatial and

non-spatial data; c) presenting data regularity con-

cisely and at higher conceptual levels; and d) helping

to reorganize spatial databases to accommodate data

semantics, as well as to achieve better performance.

Many excellent studies on data mining have been

conducted, such as those reported in [l, 2, 4, 7, 11,

13, 161. [l] considers the problem of inferring classi-

fication functions from samples; [2] studies the prob-

lem of mining association rules between sets of data

items; [7’Jproposes an attributeoriented approach to

knowledge discovery; [ll] develops a visual feedback

querying system to support data mining; and [16] in-

cludes many interesting studies on various issues in

knowledge discovery such as finding functional depen-

dencies between attributes. However, most of these

studies are concerned with knowledge discovery on

non-spatial data, and the study most relevant to our

focus here is [13] which studies spatial data min-

ing. More specifically, [13] proposes a spatial data-

dominant knowledgeextraction algorithm and a non-

spatial data-dominant one, both of which aim to ex-

tract high-level relationships between spatial and non-

spatial data. However, both algorithms suffer from

the following problems. First, the user or an expert

must provide the algorithms with spatial concept hi-

erarchies, which may not be available in many appli-

cations. Second,both algorithms conduct their spatial

exploration primarily by merging regions at a certain

level of the hierarchy to a larger region at a higher

level. Thus, the quality of the results produced by

Spatial data mining is the discovery of inter-

esting relationships and characteristics that

may exist implicitly in spatial databases. In

this paper, we explore whether clustering

methods have a role to play in spatial data

mining. To this end, we develop a new

clustering method called CLAHANS which is

based on randomized search. We also de-

velop two spatial data mining algorithms that

use CLAHANS. Our analysis and experiments

show that with the assistance of CLAHANS,

these two algorithms are very effective and

can lead to discoveries that are difficult to

find with current spatial data mining algo-

rithms. Furthermore, experiments conducted

to compare the performance of CLAHANS

with that of existing clustering methods show

that CLAHANS is the most efficient.

1 Introduction

Data mining in general is the search for hidden pat-

terns that may exist in large databases. Spatial data

mining in particular is the discovery of interesting

relationships and characteristics that may exist im-

plicitly in spatial databases. Because of the huge

Permission to copp without fee all or part of this material ia

granted provided that the copies are not made or distributed for

direct commercial advantage, the VLDB copyright notice and

the title of the publieation and itr date appear, and notice is

given that copying ir by pemierion of the Very Large Data Base

Endowment. To copy otherwise, or to republish, requirer a fee

and/or special permission from the Endowment.

Proceedings of the 20th VLDB Conference

Santiago, Chile, 1994

144

both algorithms relies quite crucially on the appropri-

ateness of the hierarchy to the given data. The prob-

lem for most applications is that it is very difficult to

know a priori which hierarchy will be the most appro-

priate. Discovering this hierarchy may itself be one of

the reasons to apply spatial data mining.

To deal with these problems, we explore whether

cluster analysis techniques are applicable. Cluster

Analysis is a branch of statistics that in the past three

decadeshas been intensely studied and successfully ap

plied to many applications. To the spatial data mining

task at hand, the attractiveness of cluster analysis is its

ability to find structures or clusters directly from the

given data, without relying on any hierarchies. How-

ever, cluster analysis has been applied rather unsuc-

cessfully in the past to general data mining and ma-

chine learning. The complaints are that cluster anal-

ysis algorithms are ineffective and inefficient. Indeed,

for cluster analysis algorithms to work effectively, there

need to be a natural notion of similarities among the

“objects” to be clustered. And traditional cluster anal-

ysis algorithms are not designed for large data sets, say

more than 2000 objects.

For spatial data mining, our approach here is to ap

ply cluster analysis only to the spatial attributes, for

which natural notions of similarities exist (e.g. Eu-

clidean or Manhattan distances). As will be shown

in this paper, in thii way, cluster analysis techniques

are effective for spatial data mining. Aa for the e%l-

ciency concern, we develop our own cluster analysis al-

gorithm, called CLAHANS, which is designed for large

data sets. More specifically, we will report in this p&

per:

clustering algorithm CLAHANS, as well as experimen-

tal result8 comparing the performance of CLAHANS,

PAM and CLARA. Section 4 studies spatial data

mining and presents two spatial data mining algo-

rithms, SD(CLAH,ANS) and NSD(CLAHANS). Sec-

tion 5 gives an experimental evaluation on the ef-

fectiveness of SD(CLAHANS) and NSD(CLAHANS)

for spatial data mining. Section 0 discusaea how

SD(CLAHANS) and NSD(CLAHANS) can assist in

further spatial discoveries, and how they can con-

tribute towards the building of a general-purpose and

powerful spatial data mining package in the future.

2 Clustering Algorithms based on Par-

titioning

2.1 PAM

In the past 30 years, cluster analysis has been widely

applied to many areas such as medicine (classification

of diseases), chemistry (grouping of compounds), so-

cial stud& (claseification of statistical findings), and

so on. Its main goal is to identify structures or clusfers

present in the data. While there is no general defini-

tion of a cluster, algorithms have been developed to

find several kinds of clusters: spherical, linear, drawn-

out, etc. See[lo, 181for more detailed discussions and

analyses of these issues. Among all the existing clus-

tering algorithms, we have chosen the k-medoid meth-

ods as the basis of our algorithm for the following rea-

sons. First, unlike many other partitioning methods,

the k-medoid methods are very robust to the existence

of outliers (i.e. data points that are very far away from

the rest of the data points). Second, clusters found

by A-medoid methods do not depend on the order in

which the objects are examined. Furthermore, they

are invariant with respect to translations and orthogo-

nal transformations of data points. La& but not least,

experiments have shown that the k-medoid methods

described below can handle very large data sets quite

efficiently. See [lo] for a more detailed comparison

of k-medoid methods with other partitioning meth-

ods. In this section, we present the two beat-known

k-medoid methods on which our algorithm is based.

PAM (Partitioning Around Medoids) was developed

by Kaufman and Housseeuw [lo]. To find k clusters,

PAM’s approach is to determine a representative ob-

ject for each cluster. Thii representative object, called

a medoid, is meant to be the most centrally located ob-

ject within the cluster. Once the medoids have been

selected, each non-selected object ia grouped with the

medoid to which it is the most similar. More pre-

cisely, if Oi is a non-selected object, and Oi is a (ae-

lected) medoid, we say that Oj belongs to the C~US-

ter represented by Oi, if d(Oj, Oi) = mino,d(Oj, O,),

where the notation mine, denotes the minimum over

l the development of CLAHANS, which is basedon

randomized search and is partly motivated by two

existing algorithms well-known in cluster analysis,

called PAM and CLARA; and

l the development of two spatial mining algorithms

SD(CLAHANS) and NSD(CLAHANS).

Given the nature of spatial data mining, and the fact

that CLAHANS is based on randomized search, the

methodology we have adopted here ia one baaed on

experimentation. In particular, we will preeent:

l experimental results showing that CLAHANS is

more efficient than the existing algorithms PAM

and CLARA; and

l experimental evidence and analysis demonstrat-

ing the effectiveness of SD(CLAHANS) and

NSD(CLAHANS) for spatial data mining.

The paper is organized as follows. Section 2 in-

troduces PAM and CLARA. Section 3 presents our

145

all medoids O,, and the notation d(O,,Ob) denotes

the dissimilarity or distance between objects 0, and

Ob. All the dissimilarity values are given as inputs

to PAM. Finally, the quality of a cludcring (i.e. the

combined quality of the chosen medoids) is measured

by the average dissimilarity between an object and the

medoid of its cluster.

To find the k medoids, PAM begins with an arbi-

trary selection of ) objects. Then in each step, a swap

between a selected object Oi and a non-selected ob-

ject Oh is made, as long as such a swap would result

in an improvement of the quality of the clustering. In

particular, to calculate the effect of such a swap be-

tween Oi and Oh, PAM computes COSb Cjih for all

non-selected objects Oj . Depending on which of the

following csxs Oj is in, Cjih is defined by one of the

equations below.

First Case: suppose Oj currently belongs to the

cluster represented by Oi. Furthermore, let Oj be

more similar to Oj,a than Oh, i.e. d(Oj, Oh) 2

CyOj, Oj,z), where Oj,r is the second most similar

medoid to Oj. Thus, if Oi is replaced by Oh as a

medoid, Oj would belong to the cluster represented

by Oj,2. Hence, the cost ofthe swap ae far as Oj is

concerned is:

and is always negative. Combining the four cases

above, the total cost of replacing Oi with Oh is given

by:

TCih =

Cjih

We now present Algorithm PAM.

Algorithm PAM

Select B representative objects arbitrarily.

lupus T&, for cdl pairs of objects Oi,Oh

where Oi is currently selected, and Oh is not.

Select the pair Oi, Oh which corresponds to

minoi,o, TCih. If the minimum TCih is nega-

tive, replace Oi with Oh, and go back to Step (2).

Otherwise, for each non-selected object, find the

most similar representative object. Halt.

Experimental results show that PAM works satisfac-

torily for small data sets (e.g. 100 objects in 5 clus-

ters [lo]). But it is not efficient in dealing with medium

and large data sets. Thii is not too surprising if we per-

form a complexity analysis on PAM. In Steps (2) and

(3), there are altogether k(n - h) pairs of Oi, Oh. For

each pair, computing TCih requires the examination

of (n - k) non-selected objects. Thus, Steps (2) and

(3) combined is of O(k(n - E)2). And this is the com-

plexity of only one iteration. Thus, it is obvious that

PAM becomes too costly for large values of n and h.

This analysis motivates the development of CLARA.

j*h = d(Oj,Oj,a)-d(Oj,Oi). (1)

This equation always gives a non-negative Cjih, indi-

cating that there is a non-negative cost incurred in

replacing Oi with Oh.

Second Case: Oj currently belongs to the cluster

represented by Oi. But this time, Oj is less similar to

Oj,2 than Oh, i.e. d(Oj,Oh) < d(Oj,Oj,a). Then, if

Oi is replaced by Oh, Oj would belong to the cluster

represented by Oh. Thus, the cost for Oj is given by:

2.2 CLARA

data sets, CLARA (Clustering LARge Applications)

relies on sampling [lo]. Instead of finding represen-

tative objects for the entire data set, CLARA draws

a sample of the data set, applies PAM on the sam-

ple, and finds the medoids of the sample. The point

is that if the sample is drawn in a sufficiently random

way, the medoids of the sample would approximate the

medoids of the entire data set. To come up with bet-

ter approximations, CLARA draws multiple samples

and gives the best cl~tering as the output. Here, for

accuracy, the quality of a clustering is measured based

on the average dissimilarity of all objects in the entire

data set, and not only of those objects in the samples.

Experiments reported in [lo] indicate that 5 samples

of size 40 + 2L give satisfactory results.

Algorithm CLARA

c*. ash = d(Oj 9Oh) - d(Oj 9Oi). (2)

Unlike in Equation (l), Cjih here can be positive or

negative, depending on whether 0, is more similar to

oi Or to oh.

Third Case: suppose that Oj currently belongs to a

cluster other than the one represented by Oi. Let Or,2

be the representative object of that cluster. Further-

more, let Oj be more similar to Oil2 than Oh. Then

even if Q is replaced by Oh, Oj would stay in the

cluster represented by Oj,2. Thus, the cost is:

Cjih = 0. (3)

Fourth Case: Oj currently belongs to the cluster

represented by Oj,2. But Oj is le~e similar to Oj,2

than Oh. Then replacing Oi with Oh would cause Oj

to jump to the cluster of Oh from that of Oj,2. Thus,

the cost is:

t*h

= d(Oj 9Oh) - d(Oj 9Oj,2),

(4)

1. Fori= 1 to 5, repeat the following steps:

146

C..

c.,

Draw a sample of 40 + 2k objects randomly from

the entire data set l, and call Algorithm PAM to

find k medoids of the sample.

Two nodes are neighbors (i.e. connected by 8i1 arc)

if their sets differ by only one object. More for-

mally, two nodes Si = {O,,, . . . , O,,} and Sz =

wwl,“‘>owy) are neighbors if and only if the car-

dinality of the intersection of Sl and Sz is A - 1, i.e.

ISin&(=k-1. Itiseasytoseethateachnodehas

k(n-k) neighbors. Since a node represents a collection

of k medoids, each node corresponds to a clustering.

Thus, each node can be assigned a cost that is defined

to be the total dissimilarity between every object and

the medoid of its cluster. It is not difficult to see that

if objects Oi, Oh sre the differences between neighbors

Sl and S8 (i.e. Oi,Oh e Si n S8, but Oi E Si and

Oh E SZ), the cost differential between the two neigh-

bors is exactly given by n:‘h defined in Equation (5).

By now, it is obvious that PAM can be viewed ss a

searchfor a minimum on the graph f&k. At each step,

all the neighbors of the current node are examined.

The current node is then replaced by the neighbor with

the deepest descent in costs. And the search continues

until a minimum is obtained. For large values of n and

k (like n = 1000 and k = lo), examining all k(n - k)

neighbors of a node is time consuming. This accounts

for the inefficiency of PAM for large data sets.

On the other hand, CLARA tries to examine fewer

neighbors and restricts the search on subgraphs that

are much smaller in sise than the original graph Gn,k.

However, the problem is that the subgraphs examined

8re defined entirely by the objects in the samples. Let

Sa be the set of objects in a sample. The subgraph

GScl,kconsists of all the nodes that are subsets (of car-

dim&ties k) of Sa. Even though CLARA thoroughly

examines Gs,,,k via PAM, the trouble is that the search

is fully confined within Gs,,,k. If M is the minimum

node in the original graph G,,,k, and if M is not in-

cluded in G&,,k, M will never be found in the search

of Gs,,,k, regardless of how thorough the search is. To

atone for this deficiency, many, many samples would

need to be collected and processed.

Like CLARA, our algorithm CLARANS does not

check every neighbor of a node. But unlike CLARA,

it does not restrict its search to a particular subgraph.

In fact, it searchesthe original graph G,,k. One key

difference between CLARANS and PAM is that the

former only checksa sample of the neighbors of a node.

But unlike CLARA, each sample is drawn dynamically

in the sensethat no nodes corresponding to particular

objects are eliminated outright. In other words, while

CLARA draws a sample of nodes at the beginning of a

search, CLARANS draws a sample of neighbors in each

step of a search. This has the benefit of not confining

a search to a localiaed area. As will be shown later, a

search by CLARANS gives higher quality clusterings

than CLARA, and CLARANS ‘requires a very small

number of searches. We now present the details of

For each object Oj in the entire data set, deter-

mine which of the k medoids is the most similar

t0 Oj.

Calculate the average dissimilarity of the cluster-

ing obtained in the previous step. If this value is

less than the current minimum, use this value 8s

the current minimum, and retain the k medoids

found in Step (2) as the best set of medoids ob-

tained so far.

Return to Step (1) to start the next iteration. 0

Complementary to PAM, CLARA performs satisfac-

torily for large data sets (e.g. 1000 objects in 10 clus-

ters). Recall from Section 2.1 that each iteration of

PAM is of O(k(n - k)‘). But for CLARA, by ap-

plying PAM just to the samples, each iteration is of

O(k(40 + k)2 + k(n - k)). This explains why CLARA

is more efficient than PAM for large values of n.

3 A Clustering

Algorithm

based on

Randomized Search

In this section, we will present our clustering algo-

rithm - CLARANS (Clustering Large Applications

basedon RANdomized Search). We will first introduce

CLARANS by giving a graph abstraction of it. Then

after describing the details of the algorithm, we will

present experimental results showing that CLARANS

outperforms CLARA and PAM in terms of both efll-

ciency and effectiveness. In the next section, we will

show how CLARANS can be used to provide effective

spatial data mining.

3.1 Motivation

of CLARANS: a Graph Ab-

straction

Given n objects, the process described above of find-

ing k medoids can be viewed abstractly as search-

ing through a certain graph. In this graph, de

noted by &,r, a node is represented by a set

of k objects (O,, , . . . , O,, ), intuitively indicat-

ing that O,, , . . . , O,, are the selected medoids.

The set of nodes in the graph is the set

{ {O,,, . . .,O,,) 1 O,,, . . .,O,,,,, are objects in the

data set).

1 [lo] reports a useful hemietic to draw samples. Apart from

the first sample, eubeequent samples include the beet set of

medoids found 80 far. In other words, apart from the ibxt itera-

tion, m&sequent iterations draw 40 + k objects to add on to the

best k medoids.

147

Algorithm CLARANS.

3.2 CLARANS

Algorithm

CLARANS

Input parameters numlocal and maxneighbor.

Initialize i to 1, and mincost to a large number.

Set current to an arbitrary node in G,,k.

Set j to 1.

Consider a random neighbor S of current, and

based on Equation (5) calculate the cost differ-

ential of the two nodes.

If 5’ haa a lower cost, set current to S, and go to

Step (3).

Otherwise, increment j by 1. If j 5 maxneighbor,

go to Step (4).

100

number of objects

Otherwise, when j > maxneighbor, compare the

cost of current with mincost. If the former is less

than mincoet, set mincost to the cost of current,

and set bestnode to current.

Figure 1: Efficiency: CLABANS vs PAM

3.3 Experimental

Results:

CLARANS vs

PAM

In the following we present experimental results com-

paring CLARANS with PAM. As discussed before,

for large and medium data sets, it is obvious that

CLABANS, while producing clusterings of very com-

parable quality, is much more efficient than PAM.

Thus, our focus here was to compare the two algo

rithma on small data sets. We applied both algorithms

to data sets with 40,60,80 and 100points in 5 clusters.

Figure 1 shows the runtime taken by both algorithms.

Note that for all those data sets, the clusterings pro-

duced by both algorithms are of the same quality (i.e.

same average distance). Thus, the difference between

the two algorithms is determined by their efficiency.

It is evident from Figure 1 that even for small data

sets, CLABANS outperforms PAM significantly. As

expected, the performance gap between the two algo

rithms grows, as the data set increases in size.

Increment i by 1. If i > numlocal, output

bestnode and halt. Otherwise, go to Step (2). 0

Steps (3) to (6) above search for nodes with progres-

sively lower costs. But if the current node has al-

ready been compared with the maximum number of

the neighbors of the node (specified by maxneighbor)

and is still of the lowest co&, the current node is de-

clared to be a “local” minimum. Then in Step (7), the

cost of this local minimum L compared with the lowest

coat obtained so far. The lower of the two coats above

is stored in mincost. Algorithm CLARANS then re-

peats to search for other local minima, until numiocul

of them have been found.

Aa shown above, CLABANS has two parame

tern: the maximum number of neighbors exam-

ined (maxneighbor), and the number of local min-

ima obtained (numlocal). The higher the value of

maxneighbor, the closer is CLABANS to PAM, and

the longer is each search of a local minima. But the

quality of such a local minimais higher, and fewer local

minima needs to be obtained. Like many applications

of randomized search [8, 91,we rely on experiments to

determine the appropriate value-sof these parameters.

All the performance results of CLARANS quoted in

the remainder of this paper are baaedon the version of

CLABANS that set numIoca1 to 2 and maxneighbor

to be the larger value between 1.25% of h(n - k) and

250. See [15] for more information on how and why

these specific values are chosen.

3.4 Experimental

Results:

CLARANS

CLARA

In this seriesof experiments, we compared CLARANS

with CLARA. As discussedin Section 2.2, CLARA is

not designed for small data sets. Thus, we ran thii set

of experiments on data sets whose number of objects

exceeda100. And the objects were organized in differ-

ent number of clusters, aa well as in different types of

clusters [15].

When we conducted this series of experiments run-

148

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: