This chapter describes how to identify clusters in the data using unsupervised cluster analysis methods.
- 18.1. Introduction
- 18.2. Clustering with sdcluster command
- 18.2.1. Clustering all data irrespective of classes
- 18.2.2. How to cluster already clustered data set?
- 18.2.3. Changing default cluster names
- 18.2.4. Removing cluster labels
- 18.3. Clustering algorithms
- 18.3.1. k-means algorithm
- 18.3.1.1. Obtaining cluster labels with sdkmeans
- 18.3.1.2. Accessing prototypes derived by sdkmeans
- 18.3.2. k-centers algorithm
- 18.3.3. Gaussian mixture model
18.1. Introduction ↩
Cluster analysis, also called unsupervised learning, groups similar observations into clusters. In this way, we may learn what typical situations occur in our data set. Cluster analysis aims at answering questions such as:
- Is there structure in my data?
- What are typical clusters/groups are in this data set?
There are two typical uses of cluster analysis. The first one aims at interpretation of clustering result. For example, we cluster an image (set of pixels in RGB feature space) and obtain a set of 10 clusters that represent different image regions. We then interpret (give meaning) to some of these clusters. In this way, we identify that some clusters represent sky, buildings or trees. With these new labels, we may train a specific classifiers applicable to new images.
Secondly, cluster analysis is often adopted as a tool for building a flexible data representation in complex supervised problems. The resulting set of clusters is used, for example, as prototypes for building better classifiers in multi-modal problems.
With perClass, you may quickly cluster even very large data sets, easily interpret the results, and directly leverage this information for building your classifiers.
18.1.1. Example of using cluster analysis to build a pixel classifier ↩
Say we want to detect road in images. However, all we have are images without road labels. Instead of painting the road region by hand, we will use clustering to define class labels.
First, we will load the image:
>> im=sdimage
('roadsign18.bmp','sddata')
412160 by 3 sddata, class: 'unknown'
and visualize it:
>> sdimage
(im)
The image shows a scene captured from a car. Note that we started from an image without any label information. Therefore, the image data set contains only the 'unknown' class.
We will now use cluster analysis to find clusters in this data set. We use
the high-level sdcluster
command on the im
data set and specify the
clustering algorithm (sdkmeans
for k-means) and number of clusters
(5):
>> a=sdcluster
(im,sdkmeans
,5)
[class 'unknown' 5 clusters]
412160 by 3 sddata, 5 'cluster' groups: [42222 53456 47132 51678 217672]
sdcluster
returns the data set a
with new 'cluster' labels. We may
visualize this image:
>> sdimage
(a)
By hovering over the image, we may see that clusters C1 and C3 correspond
mostly to the road, clusters C2 and C4 to grass and the cluster C5 to sky.
Assigning meaningful names to the clusters is the interpretation step. We
may rename the cluster labels directly in the sdimage
figure with the
'rename class' command in the right-click menu or use the sdrelab
function to quickly assign descriptive names:
>> b=sdrelab
(a,{[1 3],'road',[2 4],'grass',5,'sky'})
1: C1 -> road
2: C2 -> grass
3: C3 -> road
4: C4 -> grass
5: C5 -> sky
412160 by 3 sddata, 3 'cluster' groups: 'road'(89354) 'grass'(105134) 'sky'(217672)
We may now build the road detector on our three-class problem. We will model the road class with a Gaussian model and use the other two classes to fix the detector threshold by minimizing the error:
>> pd=sddetect
(b,'road',sdgauss
)
1: road -> road
2: grass -> non-road
3: sky -> non-road
sequential pipeline 3x1 'Gaussian model+Decision'
1 Gaussian model 3x1 one class, 1 component
2 Decision 1x1 thresholding ROC on road at op 1367
The detector may be applied to a new image:
>> im2=sdimage
('roadsign11.bmp','sddata')
412160 by 3 sddata, class: 'unknown'
>> c=im2.*pd
412160 by 3 sddata, 2 classes: 'road'(96068) 'non-road'(316092)
The .*
operator is a shorthand for execution of pipeline pd
returning
decisions and setting these decisions back in the data set.
>> sdimage
(c)
In this example, we used cluster analysis to define meaningful class labels used for training a supervised classifier.
18.2. Clustering with sdcluster command ↩
sdcluster
is a high-level command that allows us to perform cluster
analysis on a data set with different clustering algorithms. It accepts a
data set, untrained pipeline of the clustering algorithm and number of
clusters. Optionally, we may provide extra options.
In 3.0.0 release, sdcluster
supports only sdkmeans
and sdkcentres
clustering algorithms.
Consistently with all other functions in perClass toolbox, sdcluster
executes the clustering algorithm on each class of the current set of
labels:
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> b=sdcluster
(a,sdkmeans
,2)
[class 'apple' 2 clusters] [class 'banana' 2 clusters] [class 'stone' 2 clusters]
'Fruit set' 260 by 2 sddata, 6 'cluster' groups: [61 38 60 48 20 33]
The result is a data set b
with cluster labels. Note that sdcluster
does not overwrite the original labels but adds a new 'cluster' property,
setting is as current:
>> b'
'Fruit set' 260 by 2 sddata, 6 'cluster' groups: [61 38 60 48 20 33]
sample props: 'lab'->'cluster' 'class'(L) 'ident'(N) 'cluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)
The lab
field points to the new 'cluster' labels. The class
field
contains the original class labels:
>> b.lab
sdlab with 260 entries, 6 groups:
'C1'(61) 'C2'(38) 'C3'(60) 'C4'(48) 'C5'(20) 'C6'(33)
>> b.class
sdlab with 260 entries, 3 groups: 'apple'(100) 'banana'(100) 'stone'(60)
In this way, we may quickly assess the mapping between the classes and clusters:
>> sdconfmat
(b.class,b.lab)
ans =
True | Decisions
Labels | C1 C2 C3 C4 C5 C6 | Totals
-----------------------------------------------------------------------
apple | 60 32 8 0 0 0 | 100
banana | 1 6 43 48 0 2 | 100
stone | 0 0 9 0 20 31 | 60
-----------------------------------------------------------------------
Totals | 61 38 60 48 20 33 | 260
18.2.1. Clustering all data irrespective of classes ↩
If we wish to cluster the total data set, we may simply set a single class to all samples:
>> c=a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> c.lab='unknown'
'Fruit set' 260 by 2 sddata, class: 'unknown'
>> d=sdcluster
(c,sdkmeans
,2)
[class 'unknown' 2 clusters]
'Fruit set' 260 by 2 sddata, 2 'cluster' groups: 'C1'(102) 'C2'(158)
However, sdcluster
provides a faster way with the 'all' option which
executes clustering on all data:
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> d2=sdcluster
(a,sdkmeans
,2,'all')
[class 'Unknown' 2 clusters]
'Fruit set' 260 by 2 sddata, 2 'cluster' groups: 'C1'(150) 'C2'(110)
Original labels are still available:
>> d2.class
sdlab with 260 entries, 3 groups: 'apple'(100) 'banana'(100) 'stone'(60)
18.2.2. How to cluster already clustered data set? ↩
Say we would like to cluster a data set d2
from the previous example
again, now with a different number of clusters:
>> d3=sdcluster
(d2,sdkmeans
,3)
{??? Error using ==> sdcluster at 59
property 'cluster' is already present in the data set. Choose a different
property for storing clustering result using sdcluster 'lab' option.
}
sdcluster
will throw an error because the 'cluster' property is already
present in d2
. perClass will never remove a set of labels you created in
the data set. We may either remove the 'cluster' property manually or
simply specify a new name for the property holding clustering result with
'lab' option. Here we store new clustering result in 'newcluster':
>> d3=sdcluster
(d2,sdkmeans
,3,'lab','newcluster')
[class 'C1' 3 clusters] [class 'C2' 3 clusters]
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [42 45 63 39 37 34]
The data set d3
still contains the 'cluster' and 'class' label sets:
>> d3'
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [42 45 63 39 37 34]
sample props: 'lab'->'newcluster' 'class'(L) 'ident'(N) 'cluster'(L) 'newcluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)
Because sdcluster
always adds clustering outcome as new labels, we may
easily compare two clustering results:
>> sdconfmat
(d3.cluster,d3.newcluster)
ans =
True | Decisions
Labels | C1 C2 C3 C4 C5 C6 | Totals
-----------------------------------------------------------------------
C1 | 42 45 60 2 0 1 | 150
C2 | 0 0 3 37 37 33 | 110
-----------------------------------------------------------------------
Totals | 42 45 63 39 37 34 | 260
Note that 'C1' and 'C2' names in the rows refer to different clusters than the 'C1' and 'C2' in the columns. Next section shows how to avoid such confusion.
18.2.3. Changing default cluster names ↩
By default, sdcluster
uses 'C' as a prefix for cluster names. We may
change it using the 'prefix' option:
>> d4=sdcluster
(d2,sdkmeans
,3,'lab','newcluster','prefix','D')
[class 'C1' 3 clusters] [class 'C2' 3 clusters]
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [57 47 42 34 39 41]
>> sdconfmat
(d3.cluster,d4.newcluster)
ans =
True | Decisions
Labels | D1 D2 D3 D4 D5 D6 | Totals
-----------------------------------------------------------------------
C1 | 56 46 42 0 6 0 | 150
C2 | 1 1 0 34 33 41 | 110
-----------------------------------------------------------------------
Totals | 57 47 42 34 39 41 | 260
We may now easily distinguish the clusters in 'cluster' and in 'newcluster' properties.
18.2.4. Removing cluster labels ↩
If you wish to manually remove the 'cluster' property in d2
set from the
example above, note that sdcluster
set it as current labels. Therefore,
we first need to set a different property (label set) as current and only
then remove the 'cluster' property:
>> dd=setlab
(d2,'class')
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> dd=rmprop
(dd,'cluster')
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
The set dd
may be now clustered with sdcluster
default setting which
will add 'cluster' property.
18.3. Clustering algorithms ↩
This section describes low-level clustering routines available in
perClass. These routines may be used directly, without sdcluster
. They
provide a trained model (pipeline) which still needs to be applied to data
to get clustering decisions.
18.3.1. k-means algorithm ↩
The k-means clustering algorithm minimizes within-cluster variability. It
requires the parameter k
denoting the number of desired clusters.
Starting from a set of k
randomly selected examples as cluster
prototypes, it iteratively defines partitioning of the data set and shifts
these prototypes to means of the new partitions. Note, that some of the
initial k
prototypes may disappear in the optimization
process. Therefore, the resulting number of clusters may be lower.
perClass provides high-speed sdkmeans
implementation which is scalable to
very large data sets. By default, sdkmeans
works as a classifier. If we
train it on a data set, it describes each class by a set of k
prototypes
and return 1-NN classifier.
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> p=sdkmeans
(a,10)
[class 'apple' 3 clusters pruned, 7 clusters]
[class 'banana' 5 clusters pruned, 5 clusters]
[class 'stone' 3 clusters pruned, 7 clusters]
1-NN (k-means prototypes) pipeline 2x3 3 classes, 19 prototypes
Note that it provides three outputs, each for one of the classes in a
.
To use sdkmeans
for clustering, we need to add the 'cluster' option:
>> p=sdkmeans
(a,10,'cluster')
[class 'apple' 10 clusters] [class 'banana' 10 clusters] [class 'stone' 10 clusters]
1-NN (k-means prototypes) pipeline 2x30 30 classes, 30 prototypes
As a result, the pipeline returned has 30 outputs, each returning a distance to one of the 30 prototypes trained. Note that clustering is performed per class - use 'all' option to cluster all data irrespective of classes:
>> p=sdkmeans
(a,10,'cluster','all')
[class 'Unknown' 10 clusters]
1-NN (k-means prototypes) pipeline 2x10 10 classes, 10 prototypes
18.3.1.1. Obtaining cluster labels with sdkmeans ↩
To get cluster labels, we need to make p returning decisions with
sddecide
and apply the resulting pipeline to the original data:
>> pd=sddecide
(p)
sequential pipeline 2x1 '1-NN (k-means prototypes)+Decision'
1 1-NN (k-means prototypes) 2x10 10 classes, 10 prototypes
2 Decision 10x1 weighting, 10 classes, 1 ops at op 1
>> dec=a*pd
sdlab with 260 entries, 10 groups
>> dec.list
sdlist (10 entries)
ind name
1 C 1
2 C 2
3 C 3
4 C 4
5 C 5
6 C 6
7 C 7
8 C 8
9 C 9
10 C10
In order to get a data set with decisions as new labels dec
in one step, use the .*
operator:
>> b=a.*pd
'Fruit set' 260 by 2 sddata, 10 classes: [25 37 29 28 9 35 17 27 23 30]
This is roughly analogous to the use of sdcluster
:
>> c=sdcluster
(a,sdkmeans
,10,'all')
[class 'Unknown' 10 clusters]
'Fruit set' 260 by 2 sddata, 10 'cluster' groups: [22 23 34 24 26 20 38 20 25 28]
The only difference of the later approach is that sdcluster
included new
'cluster' labels while the data set b
got the decisions set into the
current label set referenced via b.lab
.
18.3.1.2. Accessing prototypes derived by sdkmeans ↩
The set of prototypes, derived by sdkmeans
is stored in the pipeline p
:
>> p
1-NN (k-means prototypes) pipeline 2x10 10 classes, 10 prototypes
>> p{1}
ans =
proto: [10x2x10 sddata]
>> proto=p{1}.proto
10 by 2 sddata, 10 classes: [1 1 1 1 1 1 1 1 1 1]
>> proto'
10 by 2 sddata, 10 classes: [1 1 1 1 1 1 1 1 1 1]
sample props: 'lab'->'class' 'class'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)
Because k-means algorithm derives (extracts) new prototypes by averaging original observations, prototypes do not contain any oroginal properties set in the clustered data set. Use k-centers algorithm if you wish to preserve prototype sample properties.
18.3.2. k-centers algorithm ↩
k-centres algorithm selects k prototypes from the original data set by in such a way that the maximum distance within each cluster is minimized. The main difference with respect to the k-means algorithm is that k-centres prototypes are observations existing in the original data, while k-means exctracted new "virtual" prototypes by averaging. Use k-centres if you wish to find typical observations and use further their original properties.
In this example, we set a unique sample id to each sample:
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> a.id=1:length(a)
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> a(127).id
ans =
127
Now, we cluster the data set with sdkcentres
asking for 10 clusters:
>> p=sdkcentres
(a,10,'cluster','all')
[class 'unknown' 10 comp]
1-NN (k-centers prototypes) pipeline 2x10 10 classes, 10 prototypes
The derived prototypes contain the id
property:
>> p{1}.proto'
'Fruit set' 10 by 2 sddata, 10 'cluster' groups: [1 1 1 1 1 1 1 1 1 1]
sample props: 'lab'->'cluster' 'class'(L) 'ident'(N) 'id'(N) 'cluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)
So we may easily identify what samples were selected:
>> p{1}.proto.id
ans =
8
235
246
185
66
207
62
71
174
117
18.3.3. Gaussian mixture model ↩
Gaussian mixture model available through sdmixture
function may be used
for data clustering. It has one substantial advantage with respect to
k-means and k-centres algorithms: sdmixture
may automatically select the
number of clusters in the data.
By default, sdmixture
operates per class and returns per-class output:
>> p=sdmixture
(a)
[class 'apple' initialization: 4 clusters EM:.............................. 4 comp]
[class 'banana' initialization: 4 clusters EM:.............................. 4 comp]
[class 'stone' initialization: 1 cluster EM:.............................. 1 comp]
Mixture of Gaussians pipeline 2x3 3 classes, 9 components
We received three outputs for the three classes in a
.
sdmixture
with 'cluster' option will return one outputfor each of the
mixture components (clusters) found. The operation is sill on the per-class
basis:
>> p=sdmixture
(a,'cluster')
[class 'apple' initialization: 4 clusters EM:.............................. 4 comp]
[class 'banana' initialization: 4 clusters EM:.............................. 4 comp]
[class 'stone' initialization: 1 cluster EM:.............................. 1 comp]
Mixture of Gaussians pipeline 2x9 9 classes, 9 components
If we wish to cluster the whole data set, we may set a single class:
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> a.lab='unknown'
'Fruit set' 260 by 2 sddata, class: 'unknown'
>> p=sdmixture
(a,'cluster')
[class 'unknown' initialization: 3 clusters EM:.............................. 3 comp]
Mixture of Gaussians pipeline 2x3 3 classes, 3 components
>> p.lab
sdlab with 3 entries: 'Cluster 1','Cluster 2','Cluster 3'
Assigning data to clusters:
>> pd=sddecide
(p)
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x3 3 classes, 3 components
2 Decision 3x1 weighting, 3 classes
>> b=a.*pd
'Fruit set' 260 by 2 sddata, 3 classes: 'Cluster 1'(72) 'Cluster 2'(80) 'Cluster 3'(108)
We may visualize the clustering results in the feature space:
>> sdscatter
(b,pd)
Using pd
, any 2D feature vector will be assigned into one of the three
clusters. Sometimes, we may wish to distinguish the feature space domain
observed when performing the clustering from far away outliers or
previously unseen clusters. We may do this by adding reject option to the
pd
pipeline:
>> pr=~sdreject`(pd,b)
Weight-based operating point,3 classes,[0.33,0.33,...]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x3 3 classes, 3 components (sdp_normal)
2 Decision 3x1 weight+reject, 4 decisions, ROC 1 ops at op 1 (sdp_decide)
>> sdscatter
(b,pr)
>> pr.list
sdlist (4 entries)
ind name
1 Cluster 1
2 Cluster 2
3 Cluster 3
4 reject
You may find more detailed discussion on this technique in this blog entry