Classifiers, table of contents

This section describes perClass classifiers based on assumption of Gaussian distribution.

- 13.2.1. Introduction
- 13.2.2. Nearest mean classifier
- 13.2.2.1. Scaled nearest mean
- 13.2.3. Linear discriminant assuming normal densities
- 13.2.4. Quadratic classifier assuming normal densities
- 13.2.5. Gaussian model or classifier
- 13.2.6. Constructing Gaussian model from parameters
- 13.2.7. Generating data based on Gaussian model
- 13.2.8. Gaussian mixture models
- 13.2.8.1. Automatic estimation of number of mixture components
- 13.2.8.2. Choosing number of mixture components manually
- 13.2.8.3. Clustering data using a mixture model
- 13.2.9. Regularization of Gaussian models

# 13.2.1. Introduction ↩

The family of Gaussian classifiers assumes that the observations are generated by a random process that has normal distribution. Density function of a normal distribution is defined by mean vector, and covariance matrix. If multiple Gaussian models are used, they are additionally weighted by priors.

In the simplest situation, the assumption is that each class is defined by a single Gaussian density. This yields nearest mean, linear and quadratic classifier. More complex setup assumes that each class may be modeled by a mixture of several Gaussians. This results in Gaussian mixture models.

In general, classifiers based on assumption of normality estimate probability densities. Therefore, their use is in practice limited to lower-dimensional situations. If the number of samples is too low compared to dimensionality or data exhibit strong subspace structure, normal models have difficulties in inverting covariance matrices.

# 13.2.2. Nearest mean classifier ↩

The nearest mean classifier `sdnmean`

leverages assumption of normality.
It uses normal model with identity covariance matrix for all classes.

Nearest mean is one of the most simple classifiers useful in situations with few samples and large number of features.

**>> load fruit**
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> p=**`sdnmean`

(a)
sequential pipeline 2x1 'Nearest mean+Decision'
1 Nearest mean 2x3 unit cov.mat.
2 Decision 3x1 weighting, 3 classes
**>> **`sdscatter`

(a,p)

An alternative formulation of nearest mean using distances to prototypes is described here.

As for all Gaussian models, user may provide apriori-known class prior
probabilities when training `sdnmean`

with 'priors' option. If not
provided, priors are estimated from training data set.

## 13.2.2.1. Scaled nearest mean ↩

With the 'scaled' option, `sdnmean`

uses covariance matrices with a
separately estimated variance for each feature.

In this way, nearest mean classifier takes into account data scaling and will not change if the multiplicative scaling of features is applied.

**>> p2=**`sdnmean`

(tr,'scaled')
sequential pipeline 2x1 'Nearest mean+Decision'
1 Nearest mean 2x3 scaled diag.cov.mat
2 Decision 3x1 weighting, 3 classes

# 13.2.3. Linear discriminant assuming normal densities ↩

`sdlinear`

is a linear discriminant based on assumption of normal
densities. The pooled covariance matrix is computed by averaging per-class
covariances taking into account class priors.

**>> load fruit;**
**>> a=a(:,:,[1 2])**
260 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
**>> p=**`sdlinear`

(a)
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x2 single cov.mat.
2 Normalization 2x2
3 Decision 2x1 weighting, 2 classes

Note the normalization step in the pipeline `p`

. This assures that the
`sdlinear`

soft output is posterior probability (confidence). Due to this
normalization, `sdlinear`

requires two or more classes be present in the
input data set.

Confusion matrix estimated on the training set at default operating point:

**>> **`sdconfmat`

(a.lab,a*p,'norm')
ans =
True | Decisions
Labels | apple banana | Totals
-------------------------------------
apple | 0.890 0.110 | 1.00
banana | 0.150 0.850 | 1.00
-------------------------------------

In order to use specific priors, provide them using the `priors`

option:

**>> p=**`sdlinear`

(a,'priors',[0.8 0.2])
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x2 single cov.mat.
2 Normalization 2x2
3 Decision 2x1 weighting, 2 classes
**>> **`sdconfmat`

(a.lab,a*p,'norm')
ans =
True | Decisions
Labels | apple banana | Totals
-------------------------------------
apple | 0.950 0.050 | 1.00
banana | 0.200 0.800 | 1.00
-------------------------------------

Note that by increasing the *apple* class prior we lower the *apple*
error. However, the *banana* error will increase accordingly.

The `sdlinear`

classifier may be regularized by adding a small constant to
a covariance diagonal. See the following section for an example on how to
use automatic regularization.

# 13.2.4. Quadratic classifier assuming normal densities ↩

`sdquadratic`

implements quadratic discriminant based on assumption of
normal densities. In fact, `sdquadrartic`

is composed of an `sdgauss`

pipeline with an extra normalization step that makes sure the classifier
returns posteriors. Due to this class normalization, `sdquadratic`

can only
be trained on data set with two or more classes.

The quadratic decision boundary is achieved by estimating a specific covariance matrix for each class.

**>> p=**`sdquadratic`

(a)
sequential pipeline 2x1 'Quadratic discriminant'
1 Gaussian model 2x2 full cov.mat.
2 Normalization 2x2
3 Decision 2x1 weighting, 2 classes

Detailed display of pipeline contents shows the outputs of the first and second step (density and posterior):

**>> p'**
sequential pipeline 2x1 'Quadratic discriminant'
1 Gaussian model 2x3 full cov.mat.
inlab: 'Feature 1','Feature 2'
lab: 'apple','banana','stone'
output: probability density
complab: component labels
2 Normalization 3x3
inlab: 'apple','banana','stone'
lab: 'apple','banana','stone'
output: posterior
3 Decision 3x1 weighting, 3 classes
inlab: 'apple','banana','stone'
output: decision ('apple','banana','stone')

Visualizing the classifier decisions:

**>> **`sdscatter`

(a,p)

Similarly to other normal-based models, the user may fix class priors with 'priors' option, if know apriori.

# 13.2.5. Gaussian model or classifier ↩

`sdgauss`

implements a general Gaussian model or classifier with full
covariances. By default, `sdgauss`

returns a classifier:

**>> p=**`sdgauss`

(a)
sequential pipeline 2x1 'Gaussian model+Decision'
1 Gaussian model 2x3 full cov.mat.
2 Decision 3x1 weighting, 3 classes

Unlike `sdquadratic`

, `sdgauss`

does not normalize class outputs. It
returns probability density and, therefore, may be used to build a detector.

In fact, training `sdgauss`

on data with only one class returns a one-class
classifier that accepts all training examples:

**>> a**
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> p=**`sdgauss`

(a(:,:,'apple'))
sequential pipeline 2x1 'Gaussian model+Decision'
1 Gaussian model 2x1 full cov.mat.
2 Decision 1x1 threshold on 'apple'
**>> **`sdscatter`

(a,p)

# 13.2.6. Constructing Gaussian model from parameters ↩

`sdgauss`

may be also used to create Gaussian pipeline from parameters. We
need to provide:

`sddata`

set with component means- cell array with covariance matrices
- vector of priors with an entry for each component

Example: I'd like to create a model with two components, one with mean at
`[0 0]`

and the other at `[5 3]`

with the first covariance unit and second
`[1 0.3; -0.8 0.5]`

.

**>> m=**`sddata`

([0 0; 5 3])
2 by 2 sddata, class: 'unknown'
**>> cov={eye(2) [1 .5; .5 2]}**
cov =
[2x2 double] [2x2 double]
**>> prior=[0.5 0.5]**
prior =
0.5000 0.5000
**>> p=**`sdgauss`

(m,cov,prior)
Gaussian model pipeline 2x1

To visualize the output of the pipline `p`

, we need a data set to pass to
`sdscatter`

. We can use class means and add two points extending our view:

**>> **`sdscatter`

(`sddata`

([+m; -5 -5; 10 10]),p)

# 13.2.7. Generating data based on Gaussian model ↩

Gaussian model are "generative" which means we can use them to create a data set following the same distribution.

In perClass, we may use the `sdgenerate`

command to do that. Using the
pipeline `p`

form the previous section, we can create data set with 100
samples *per component* by:

**>> b=**`sdgenerate`

(p,100)
200 by 2 sddata, class: 'unknown'
**>> **`sdscatter`

(b,p)

# 13.2.8. Gaussian mixture models ↩

Gaussian mixture is a density-estimation approach using a weighted sum of
multiple Gaussian components. The `sdmixture`

implements a classifier with
a separate mixture model, estimated for each of the classes.

Advantage of a mixture classifier is its flexibility. Given enough training data, it may reliably model class distributions with arbitrary shapes or disjoint clusters (modes). Multi-modal data often originate in applications where our top-level classes reflect a composition of underlying states. For example, a medical diagnostic tool aims at detection of cancer. As the disease may affect different tissues, we observe our "cancer" class as a multi-modal composition of separate data tissue clusters. Mixture classifier can naturally describe such class distribution.

While parameters of a single Gaussian model may be estimated directly from data, mixture models require iterative optimization based on EM algorithm.

## 13.2.8.1. Automatic estimation of number of mixture components ↩

In perClass, the `sdmixture`

command does not need any input arguments
apart of training data:

**>> a**
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> p=**`sdmixture`

(a)
[class 'apple' init:.......... 4 clusters EM:done 4 comp] [class 'banana' init:....
...... 4 clusters EM:done 4 comp] [class 'stone' init:.......... 2 clusters EM:done 2 comp]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x3 10 components, full cov.mat.
2 Decision 3x1 weighting, 3 classes
**>> **`sdscatter`

(a,p)

By default, `sdmixture`

automatically identifies the number of components
for each class and then estimates each per-class mixture running 100
iterations of EM algorithm. Number of iterations can be changed with 'iter'
option.

The estimation of the number of components performs an internal random
split of the input data set. Therefore, we will receive potentially
different solution each run. Fixing the seed of the Matlab random number
generator makes the `sdmixture`

training repeatable.

**>> rand('state',1234); p=**`sdmixture`

(a) % example of setting random seed

Let us visualize the soft outputs of the mixture. We will use `-p`

shorthand to remove the decision step:

**>> **`sdscatter`

(a,-p)

We can see the density of the mixture model for the first class ('apple') comprised of four components. Clicking the arrow buttons on the Figure toolbar, we can see the density estimates for the second and third class

The estimation of number of components is performed on a subset of the input data. By default 500 samples is used per class. This may be adjusted using 'maxsamples' option.

The algorithm uses a grid search considering 1:10 clusters by default. The grid may be adjusted with 'cluster grid' option.

## 13.2.8.2. Choosing number of mixture components manually ↩

We may fix the number of mixture components manually by providing it as the second argument:

**>> p=**`sdmixture`

(a,5)
[class 'apple'EM:done 5 comp] [class 'banana'EM:done 5 comp] [class 'stone'EM:done 5 comp]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x3 15 components, full cov.mat.
2 Decision 3x1 weighting, 3 classes

If a scalar value is given, the same number of components is used for all classes.

By providing a vector, we may fix different number of components per class. In our example, we know that the 'stone' class is unimodal:

**>> p=**`sdmixture`

(a,[5 5 1])
[class 'apple'EM:done 5 comp] [class 'banana'EM:done 5 comp] [class 'stone'EM:done 1 comp]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x3 11 components, full cov.mat.
2 Decision 3x1 weighting, 3 classes

## 13.2.8.3. Clustering data using a mixture model ↩

Mixture model which estimates the number of clusters automatically is a powerful data description tool. We may use it to cluster a data set i.e. to identify its distinct modes.

Let us consider a data set `b`

with only one class. We create it by
removing class labels from *fruit* data set:

**>> a**
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> b=sddata(+a)**
260 by 2 sddata, class: 'unknown'

Let us train a mixture model twice. First, without any option:

**>> rand('state',10); pc1=**`sdmixture`

(b)
[class 'unknown' init:.......... 8 clusters EM:done 8 comp]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x1 8 components, full cov.mat.
2 Decision 1x1 threshold on 'unknown'

Second, with 'cluster' option:

**>> rand('state',10); pc2=**`sdmixture`

(b,'cluster')
[class 'unknown' init:.......... 8 clusters EM:done 8 comp]
sequential pipeline 2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians 2x8 8 components, full cov.mat.
2 Decision 8x1 weighting, 8 classes

The `rand`

commands make sure both estimated mixtures are identical.

The only difference between pipeline `pc1`

and `pc2`

is in labeling of the
mixture components.

The mixture `pc1`

models all available data in `b`

by one mixture model
returning one soft output (see the output dimensionality of 1 in the
pipeline step 1). Let us visualize the decisions and soft outputs of `pc1`

:

**>> **`sdscatter`

(b,pc1)
ans =
1
**>> **`sdscatter`

(b,-pc1)
ans =
2

The `pc2`

pipeline, trained with 'cluster' option, splits the complete
mixture into individual components. Therefore, we can see 8 distinct soft
output. Default decisions will be 'Cluster 1' to 'Cluster 8':

**>> **`sdscatter`

(b,pc2)
ans =
3
**>> **`sdscatter`

(b,-pc2)
ans =
4

Note that we observe only one component at a time in the soft output Figure
4. Others are available via arrow buttons. The decisions of `pc2`

provide
cluster labels.

# 13.2.9. Regularization of Gaussian models ↩

Regularization is a technique of avoiding problem with parameter estimation caused by with small number of data samples in a problem with large dimensionality.

Normal-based classifier suffer, in such situations, from low quality estimates of class covariances. This results either in poor classifier performance or failure to train on a given data set.

Possible solution is to add a small constant to diagonal elements of the covariance matrices. In perClass, this can be achieved with the 'reg' option.

In the following example, we want to recognize handwritten digits represented directly by pixels in 16x16 raster. Quadratic classifier fails to discriminate the digits if trained on our limited training set of 60 samples per class. The full covariances are not reliably estimated.

**>> [tr,ts]=**`randsubset`

(a,0.3)
'Digits' 600 by 256 sddata, 10 classes: [60 60 60 60 60 60 60 60 60 60]
'Digits' 1400 by 256 sddata, 10 classes: [140 140 140 140 140 140 140 140 140 140]
**>> p=**`sdquadratic`

(tr)
sequential pipeline 256x1 'Quadratic discriminant'
1 Gaussian model 256x10 full cov.mat.
2 Normalization 10x10
3 Decision 10x1 weighting, 10 classes
**>> **`sdtest`

(ts,p)
ans =
0.9000

Using the 'reg' option, `sdquadratic`

performs automatic choice of the
regularization parameter. The test set error of the resulting classifier is
only 8.5%!

**>> p=**`sdquadratic`

(tr,'reg')
..........
reg=0.100000 err=0.07
sequential pipeline 256x1 'Quadratic discriminant'
1 Gaussian model 256x10 full cov.mat.
2 Normalization 10x10
3 Decision 10x1 weighting, 10 classes
**>> **`sdtest`

(ts,p)
ans =
0.0921

The 'reg' option splits internally the provided data into two parts and performs a grid search. One part is used for training the model, the other one for evaluating the error. The default splitting fraction is 20% for validation. It may be changed with the 'tsfrac' option.

Alternatively, the internal splitting is avoided by providing the test (validation) set with 'test' option:

**>> rand('state',42); [tr2,val]=**`randsubset`

(tr,0.5)
**>> p=**`sdquadratic`

(tr2,'reg','test',val);

The regularization parameter may be provided directly:

**>> p=**`sdquadratic`

(tr,'reg',0.01)