Efficient and Effective Clustering Methods for Spatial Data Mining.pdf

(1334 KB) Pobierz
Efficient and Effective Clustering Methods for Spatial Data Mining
Efficient and Effective Clustering Methods for Spatial
Data Mining
Raymond T. Ng
Department of Computer Science
University of British Columbia
Vancouver, B.C., V6T 124, Canada
rng@cs.ubc.ca
Jiawei Han
School of Computing Sciences
Simon Fraser University
Burnaby, B.C., V5A lS6, Canada
hanQcs.sfu.ca
Abstract
amounts (usually, tersbytes) of spatial data that may
be obtained from satellite images, medical equipments,
video cameras, etc., it is costly and often unrealistic
for users to examine spatial data in detail. Spatial
data mining aims to automate such a knowledge dis-
covery process. Thus, it plays an important role in a)
extracting interesting spatial patterns and features; b)
capturing intrinsic relationships between spatial and
non-spatial data; c) presenting data regularity con-
cisely and at higher conceptual levels; and d) helping
to reorganize spatial databases to accommodate data
semantics, as well as to achieve better performance.
Many excellent studies on data mining have been
conducted, such as those reported in [l, 2, 4, 7, 11,
13, 161. [l] considers the problem of inferring classi-
fication functions from samples; [2] studies the prob-
lem of mining association rules between sets of data
items; [7’Jproposes an attributeoriented approach to
knowledge discovery; [ll] develops a visual feedback
querying system to support data mining; and [16] in-
cludes many interesting studies on various issues in
knowledge discovery such as finding functional depen-
dencies between attributes. However, most of these
studies are concerned with knowledge discovery on
non-spatial data, and the study most relevant to our
focus here is [13] which studies spatial data min-
ing. More specifically, [13] proposes a spatial data-
dominant knowledgeextraction algorithm and a non-
spatial data-dominant one, both of which aim to ex-
tract high-level relationships between spatial and non-
spatial data. However, both algorithms suffer from
the following problems. First, the user or an expert
must provide the algorithms with spatial concept hi-
erarchies, which may not be available in many appli-
cations. Second,both algorithms conduct their spatial
exploration primarily by merging regions at a certain
level of the hierarchy to a larger region at a higher
level. Thus, the quality of the results produced by
Spatial data mining is the discovery of inter-
esting relationships and characteristics that
may exist implicitly in spatial databases. In
this paper, we explore whether clustering
methods have a role to play in spatial data
mining. To this end, we develop a new
clustering method called CLAHANS which is
based on randomized search. We also de-
velop two spatial data mining algorithms that
use CLAHANS. Our analysis and experiments
show that with the assistance of CLAHANS,
these two algorithms are very effective and
can lead to discoveries that are difficult to
find with current spatial data mining algo-
rithms. Furthermore, experiments conducted
to compare the performance of CLAHANS
with that of existing clustering methods show
that CLAHANS is the most efficient.
1 Introduction
Data mining in general is the search for hidden pat-
terns that may exist in large databases. Spatial data
mining in particular is the discovery of interesting
relationships and characteristics that may exist im-
plicitly in spatial databases. Because of the huge
Permission to copp without fee all or part of this material ia
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and
the title of the publieation and itr date appear, and notice is
given that copying ir by pemierion of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requirer a fee
and/or special permission from the Endowment.
Proceedings of the 20th VLDB Conference
Santiago, Chile, 1994
144
94326476.002.png
both algorithms relies quite crucially on the appropri-
ateness of the hierarchy to the given data. The prob-
lem for most applications is that it is very difficult to
know a priori which hierarchy will be the most appro-
priate. Discovering this hierarchy may itself be one of
the reasons to apply spatial data mining.
To deal with these problems, we explore whether
cluster analysis techniques are applicable. Cluster
Analysis is a branch of statistics that in the past three
decadeshas been intensely studied and successfully ap
plied to many applications. To the spatial data mining
task at hand, the attractiveness of cluster analysis is its
ability to find structures or clusters directly from the
given data, without relying on any hierarchies. How-
ever, cluster analysis has been applied rather unsuc-
cessfully in the past to general data mining and ma-
chine learning. The complaints are that cluster anal-
ysis algorithms are ineffective and inefficient. Indeed,
for cluster analysis algorithms to work effectively, there
need to be a natural notion of similarities among the
“objects” to be clustered. And traditional cluster anal-
ysis algorithms are not designed for large data sets, say
more than 2000 objects.
For spatial data mining, our approach here is to ap
ply cluster analysis only to the spatial attributes, for
which natural notions of similarities exist (e.g. Eu-
clidean or Manhattan distances). As will be shown
in this paper, in thii way, cluster analysis techniques
are effective for spatial data mining. Aa for the e%l-
ciency concern, we develop our own cluster analysis al-
gorithm, called CLAHANS, which is designed for large
data sets. More specifically, we will report in this p&
per:
clustering algorithm CLAHANS, as well as experimen-
tal result8 comparing the performance of CLAHANS,
PAM and CLARA. Section 4 studies spatial data
mining and presents two spatial data mining algo-
rithms, SD(CLAH,ANS) and NSD(CLAHANS). Sec-
tion 5 gives an experimental evaluation on the ef-
fectiveness of SD(CLAHANS) and NSD(CLAHANS)
for spatial data mining. Section 0 discusaea how
SD(CLAHANS) and NSD(CLAHANS) can assist in
further spatial discoveries, and how they can con-
tribute towards the building of a general-purpose and
powerful spatial data mining package in the future.
2 Clustering Algorithms based on Par-
titioning
2.1 PAM
In the past 30 years, cluster analysis has been widely
applied to many areas such as medicine (classification
of diseases), chemistry (grouping of compounds), so-
cial stud& (claseification of statistical findings), and
so on. Its main goal is to identify structures or clusfers
present in the data. While there is no general defini-
tion of a cluster, algorithms have been developed to
find several kinds of clusters: spherical, linear, drawn-
out, etc. See[lo, 181for more detailed discussions and
analyses of these issues. Among all the existing clus-
tering algorithms, we have chosen the k-medoid meth-
ods as the basis of our algorithm for the following rea-
sons. First, unlike many other partitioning methods,
the k-medoid methods are very robust to the existence
of outliers (i.e. data points that are very far away from
the rest of the data points). Second, clusters found
by A-medoid methods do not depend on the order in
which the objects are examined. Furthermore, they
are invariant with respect to translations and orthogo-
nal transformations of data points. La& but not least,
experiments have shown that the k-medoid methods
described below can handle very large data sets quite
efficiently. See [lo] for a more detailed comparison
of k-medoid methods with other partitioning meth-
ods. In this section, we present the two beat-known
k-medoid methods on which our algorithm is based.
PAM (Partitioning Around Medoids) was developed
by Kaufman and Housseeuw [lo]. To find k clusters,
PAM’s approach is to determine a representative ob-
ject for each cluster. Thii representative object, called
a medoid, is meant to be the most centrally located ob-
ject within the cluster. Once the medoids have been
selected, each non-selected object ia grouped with the
medoid to which it is the most similar. More pre-
cisely, if Oi is a non-selected object, and Oi is a (ae-
lected) medoid, we say that Oj belongs to the C~US-
ter represented by Oi, if d(Oj, Oi) = mino,d(Oj, O,),
where the notation mine, denotes the minimum over
l the development of CLAHANS, which is basedon
randomized search and is partly motivated by two
existing algorithms well-known in cluster analysis,
called PAM and CLARA; and
l the development of two spatial mining algorithms
SD(CLAHANS) and NSD(CLAHANS).
Given the nature of spatial data mining, and the fact
that CLAHANS is based on randomized search, the
methodology we have adopted here ia one baaed on
experimentation. In particular, we will preeent:
l experimental results showing that CLAHANS is
more efficient than the existing algorithms PAM
and CLARA; and
l experimental evidence and analysis demonstrat-
ing the effectiveness of SD(CLAHANS) and
NSD(CLAHANS) for spatial data mining.
The paper is organized as follows. Section 2 in-
troduces PAM and CLARA. Section 3 presents our
145
94326476.003.png
all medoids O,, and the notation d(O,,Ob) denotes
the dissimilarity or distance between objects 0, and
Ob. All the dissimilarity values are given as inputs
to PAM. Finally, the quality of a cludcring (i.e. the
combined quality of the chosen medoids) is measured
by the average dissimilarity between an object and the
medoid of its cluster.
To find the k medoids, PAM begins with an arbi-
trary selection of ) objects. Then in each step, a swap
between a selected object Oi and a non-selected ob-
ject Oh is made, as long as such a swap would result
in an improvement of the quality of the clustering. In
particular, to calculate the effect of such a swap be-
tween Oi and Oh, PAM computes COSb Cjih for all
non-selected objects Oj . Depending on which of the
following csxs Oj is in, Cjih is defined by one of the
equations below.
First Case: suppose Oj currently belongs to the
cluster represented by Oi. Furthermore, let Oj be
more similar to Oj,a than Oh, i.e. d(Oj, Oh) 2
CyOj, Oj,z), where Oj,r is the second most similar
medoid to Oj. Thus, if Oi is replaced by Oh as a
medoid, Oj would belong to the cluster represented
by Oj,2. Hence, the cost ofthe swap ae far as Oj is
concerned is:
and is always negative. Combining the four cases
above, the total cost of replacing Oi with Oh is given
by:
TCih =
c
Cjih
We now present Algorithm PAM.
Algorithm PAM
Select B representative objects arbitrarily.
lupus T&, for cdl pairs of objects Oi,Oh
where Oi is currently selected, and Oh is not.
Select the pair Oi, Oh which corresponds to
minoi,o, TCih. If the minimum TCih is nega-
tive, replace Oi with Oh, and go back to Step (2).
Otherwise, for each non-selected object, find the
most similar representative object. Halt.
0
Experimental results show that PAM works satisfac-
torily for small data sets (e.g. 100 objects in 5 clus-
ters [lo]). But it is not efficient in dealing with medium
and large data sets. Thii is not too surprising if we per-
form a complexity analysis on PAM. In Steps (2) and
(3), there are altogether k(n - h) pairs of Oi, Oh. For
each pair, computing TCih requires the examination
of (n - k) non-selected objects. Thus, Steps (2) and
(3) combined is of O(k(n - E)2). And this is the com-
plexity of only one iteration. Thus, it is obvious that
PAM becomes too costly for large values of n and h.
This analysis motivates the development of CLARA.
j*h = d(Oj,Oj,a)-d(Oj,Oi). (1)
This equation always gives a non-negative Cjih, indi-
cating that there is a non-negative cost incurred in
replacing Oi with Oh.
Second Case: Oj currently belongs to the cluster
represented by Oi. But this time, Oj is less similar to
Oj,2 than Oh, i.e. d(Oj,Oh) < d(Oj,Oj,a). Then, if
Oi is replaced by Oh, Oj would belong to the cluster
represented by Oh. Thus, the cost for Oj is given by:
2.2 CLARA
Designed by Kaufman and Bousseeuwto handle large
data sets, CLARA (Clustering LARge Applications)
relies on sampling [lo]. Instead of finding represen-
tative objects for the entire data set, CLARA draws
a sample of the data set, applies PAM on the sam-
ple, and finds the medoids of the sample. The point
is that if the sample is drawn in a sufficiently random
way, the medoids of the sample would approximate the
medoids of the entire data set. To come up with bet-
ter approximations, CLARA draws multiple samples
and gives the best cl~tering as the output. Here, for
accuracy, the quality of a clustering is measured based
on the average dissimilarity of all objects in the entire
data set, and not only of those objects in the samples.
Experiments reported in [lo] indicate that 5 samples
of size 40 + 2L give satisfactory results.
Algorithm CLARA
c*. ash = d(Oj 9Oh) - d(Oj 9Oi). (2)
Unlike in Equation (l), Cjih here can be positive or
negative, depending on whether 0, is more similar to
oi Or to oh.
Third Case: suppose that Oj currently belongs to a
cluster other than the one represented by Oi. Let Or,2
be the representative object of that cluster. Further-
more, let Oj be more similar to Oil2 than Oh. Then
even if Q is replaced by Oh, Oj would stay in the
cluster represented by Oj,2. Thus, the cost is:
Cjih = 0. (3)
Fourth Case: Oj currently belongs to the cluster
represented by Oj,2. But Oj is le~e similar to Oj,2
than Oh. Then replacing Oi with Oh would cause Oj
to jump to the cluster of Oh from that of Oj,2. Thus,
the cost is:
t*h
= d(Oj 9Oh) - d(Oj 9Oj,2),
(4)
1. Fori= 1 to 5, repeat the following steps:
146
C..
c.,
94326476.004.png
2.
Draw a sample of 40 + 2k objects randomly from
the entire data set l, and call Algorithm PAM to
find k medoids of the sample.
Two nodes are neighbors (i.e. connected by 8i1 arc)
if their sets differ by only one object. More for-
mally, two nodes Si = {O,,, . . . , O,,} and Sz =
wwl,“‘>owy) are neighbors if and only if the car-
dinality of the intersection of Sl and Sz is A - 1, i.e.
ISin&(=k-1. Itiseasytoseethateachnodehas
k(n-k) neighbors. Since a node represents a collection
of k medoids, each node corresponds to a clustering.
Thus, each node can be assigned a cost that is defined
to be the total dissimilarity between every object and
the medoid of its cluster. It is not difficult to see that
if objects Oi, Oh sre the differences between neighbors
Sl and S8 (i.e. Oi,Oh e Si n S8, but Oi E Si and
Oh E SZ), the cost differential between the two neigh-
bors is exactly given by n:‘h defined in Equation (5).
By now, it is obvious that PAM can be viewed ss a
searchfor a minimum on the graph f&k. At each step,
all the neighbors of the current node are examined.
The current node is then replaced by the neighbor with
the deepest descent in costs. And the search continues
until a minimum is obtained. For large values of n and
k (like n = 1000 and k = lo), examining all k(n - k)
neighbors of a node is time consuming. This accounts
for the inefficiency of PAM for large data sets.
On the other hand, CLARA tries to examine fewer
neighbors and restricts the search on subgraphs that
are much smaller in sise than the original graph Gn,k.
However, the problem is that the subgraphs examined
8re defined entirely by the objects in the samples. Let
Sa be the set of objects in a sample. The subgraph
GScl,kconsists of all the nodes that are subsets (of car-
dim&ties k) of Sa. Even though CLARA thoroughly
examines Gs,,,k via PAM, the trouble is that the search
is fully confined within Gs,,,k. If M is the minimum
node in the original graph G,,,k, and if M is not in-
cluded in G&,,k, M will never be found in the search
of Gs,,,k, regardless of how thorough the search is. To
atone for this deficiency, many, many samples would
need to be collected and processed.
Like CLARA, our algorithm CLARANS does not
check every neighbor of a node. But unlike CLARA,
it does not restrict its search to a particular subgraph.
In fact, it searchesthe original graph G,,k. One key
difference between CLARANS and PAM is that the
former only checksa sample of the neighbors of a node.
But unlike CLARA, each sample is drawn dynamically
in the sensethat no nodes corresponding to particular
objects are eliminated outright. In other words, while
CLARA draws a sample of nodes at the beginning of a
search, CLARANS draws a sample of neighbors in each
step of a search. This has the benefit of not confining
a search to a localiaed area. As will be shown later, a
search by CLARANS gives higher quality clusterings
than CLARA, and CLARANS ‘requires a very small
number of searches. We now present the details of
3.
For each object Oj in the entire data set, deter-
mine which of the k medoids is the most similar
t0 Oj.
4.
Calculate the average dissimilarity of the cluster-
ing obtained in the previous step. If this value is
less than the current minimum, use this value 8s
the current minimum, and retain the k medoids
found in Step (2) as the best set of medoids ob-
tained so far.
5.
Return to Step (1) to start the next iteration. 0
Complementary to PAM, CLARA performs satisfac-
torily for large data sets (e.g. 1000 objects in 10 clus-
ters). Recall from Section 2.1 that each iteration of
PAM is of O(k(n - k)‘). But for CLARA, by ap-
plying PAM just to the samples, each iteration is of
O(k(40 + k)2 + k(n - k)). This explains why CLARA
is more efficient than PAM for large values of n.
3 A Clustering
Algorithm
based on
Randomized Search
In this section, we will present our clustering algo-
rithm - CLARANS (Clustering Large Applications
basedon RANdomized Search). We will first introduce
CLARANS by giving a graph abstraction of it. Then
after describing the details of the algorithm, we will
present experimental results showing that CLARANS
outperforms CLARA and PAM in terms of both efll-
ciency and effectiveness. In the next section, we will
show how CLARANS can be used to provide effective
spatial data mining.
3.1 Motivation
of CLARANS: a Graph Ab-
straction
Given n objects, the process described above of find-
ing k medoids can be viewed abstractly as search-
ing through a certain graph. In this graph, de
noted by &,r, a node is represented by a set
of k objects (O,, , . . . , O,, ), intuitively indicat-
ing that O,, , . . . , O,, are the selected medoids.
The set of nodes in the graph is the set
{ {O,,, . . .,O,,) 1 O,,, . . .,O,,,,, are objects in the
data set).
1 [lo] reports a useful hemietic to draw samples. Apart from
the first sample, eubeequent samples include the beet set of
medoids found 80 far. In other words, apart from the ibxt itera-
tion, m&sequent iterations draw 40 + k objects to add on to the
best k medoids.
147
94326476.005.png
Algorithm CLARANS.
3.2 CLARANS
Algorithm
CLARANS
1.
Input parameters numlocal and maxneighbor.
Initialize i to 1, and mincost to a large number.
2.
Set current to an arbitrary node in G,,k.
3.
Set j to 1.
4.
Consider a random neighbor S of current, and
based on Equation (5) calculate the cost differ-
ential of the two nodes.
5.
If 5’ haa a lower cost, set current to S, and go to
Step (3).
6.
Otherwise, increment j by 1. If j 5 maxneighbor,
go to Step (4).
40
60
80
100
number of objects
7.
Otherwise, when j > maxneighbor, compare the
cost of current with mincost. If the former is less
than mincoet, set mincost to the cost of current,
and set bestnode to current.
Figure 1: Efficiency: CLABANS vs PAM
3.3 Experimental
Results:
CLARANS vs
PAM
In the following we present experimental results com-
paring CLARANS with PAM. As discussed before,
for large and medium data sets, it is obvious that
CLABANS, while producing clusterings of very com-
parable quality, is much more efficient than PAM.
Thus, our focus here was to compare the two algo
rithma on small data sets. We applied both algorithms
to data sets with 40,60,80 and 100points in 5 clusters.
Figure 1 shows the runtime taken by both algorithms.
Note that for all those data sets, the clusterings pro-
duced by both algorithms are of the same quality (i.e.
same average distance). Thus, the difference between
the two algorithms is determined by their efficiency.
It is evident from Figure 1 that even for small data
sets, CLABANS outperforms PAM significantly. As
expected, the performance gap between the two algo
rithms grows, as the data set increases in size.
8.
Increment i by 1. If i > numlocal, output
bestnode and halt. Otherwise, go to Step (2). 0
Steps (3) to (6) above search for nodes with progres-
sively lower costs. But if the current node has al-
ready been compared with the maximum number of
the neighbors of the node (specified by maxneighbor)
and is still of the lowest co&, the current node is de-
clared to be a “local” minimum. Then in Step (7), the
cost of this local minimum L compared with the lowest
coat obtained so far. The lower of the two coats above
is stored in mincost. Algorithm CLARANS then re-
peats to search for other local minima, until numiocul
of them have been found.
Aa shown above, CLABANS has two parame
tern: the maximum number of neighbors exam-
ined (maxneighbor), and the number of local min-
ima obtained (numlocal). The higher the value of
maxneighbor, the closer is CLABANS to PAM, and
the longer is each search of a local minima. But the
quality of such a local minimais higher, and fewer local
minima needs to be obtained. Like many applications
of randomized search [8, 91,we rely on experiments to
determine the appropriate value-sof these parameters.
All the performance results of CLARANS quoted in
the remainder of this paper are baaedon the version of
CLABANS that set numIoca1 to 2 and maxneighbor
to be the larger value between 1.25% of h(n - k) and
250. See [15] for more information on how and why
these specific values are chosen.
3.4 Experimental
Results:
CLARANS
vs
CLARA
In this seriesof experiments, we compared CLARANS
with CLARA. As discussedin Section 2.2, CLARA is
not designed for small data sets. Thus, we ran thii set
of experiments on data sets whose number of objects
exceeda100. And the objects were organized in differ-
ent number of clusters, aa well as in different types of
clusters [15].
When we conducted this series of experiments run-
148
94326476.001.png
Zgłoś jeśli naruszono regulamin