perClass Documentation version 5.4 (7-Dec-2018)

Chapter 17: Classifier combining and cascades

This chapter describes combining multiple classifiers.

# 17.1. Classifier combining introduction ↩

Classifier combining leverages multiple classifiers to improve performance. Combining is known under many other names such as classifier fusion, multiple classifier systems, mixtures of experts etc.

The concept of classifier fusion is quite general. In fact, several perClass classifiers can be considered as compositions of elementary learners. This includes mixtures of Gaussians, decision trees, random forests or neural networks.

In this chapter, we describe different approaches based on fusion of any of the basic models into one system. Two major strategies are discussed:

In the first approach, soft outputs of the base models such as posterior probabilities or distances are combined. In the second, we fuse the crisp decisions.

# 17.2. Soft-output combining ↩

perClass supports two major approaches to soft-output classifier combining, namely fixed and trained combiners. Fixed combiners are rules chosen on the basis of our assumptions on the problem. For example, if we know that classifiers trained on different data representations are independent, we might choose product rule.

The second type of classifier combiner is trained. Instead of assuming any fusion strategy, we learn it from examples. Trained combiners simply use per-class outputs of base classifiers as new features.

## 17.2.1. Stacking multiple classifiers ↩

In order to build any combiner, we need to train the base classifiers, and stack them into a single pipeline. Lets consider this example on fruit data:

``````>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
1 Gaussian model         2x3  full cov.mat.
2 Normalization          3x3
3 Decision               3x1  weighting, 3 classes
>> p2=sdparzen(a)
.....sequential pipeline  2x1 'Parzen model+Decision'
1 Parzen model           2x3  260 prototypes, h=0.8
2 Decision               3x1  weighting, 3 classes
``````

The soft outputs of out classifiers are accessible by removing the decision step. This may be done using an unary `-` operator:

``````>> -p1
sequential pipeline       2x3 'Gaussian model+Normalization'
1 Gaussian model          2x3  full cov.mat.
2 Normalization           3x3

>> -p2
Parzen model pipeline     2x3  260 prototypes, h=0.8
``````

### 17.2.1.1. Creating stack pipelines by concatenation ↩

To combine soft outputs, we need to stack multiple classifiers together. This may be achieved using horizontal concatenation operator:

``````>> ps=[-p1 -p2]
stack pipeline            2x6  2 classifiers in 2D space
``````

Applying our stacking pipeline to the 2D data, we obtain 6D data set:

``````>> b=a*ps
'Fruit set' 260 by 6 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
``````

The new features correspond to per-class soft outputs of the base classifiers. Because our training set contains three classes, we have apple, banana and stone output from each base classifier:

``````>> +ps.lab

ans =

apple
banana
stone
apple
banana
stone
``````

### 17.2.1.2. Creating stack pipelines from cell arrays ↩

Classifier stacks may be also created from a cell array of pipeline using `sdstack` command.

``````>> ps=sdstack({-p1 -p2})
stack pipeline            2x6  2 classifiers in 2D space
``````

### 17.2.1.3. Accessing base classifiers from the stack ↩

Details on the stacked pipeline can be obtained with the `'` transpose operator:

``````>> ps'
stacked pipeline        2x6 'stack'
inlab:'1','2'
1 Gaussian model+Normalization      2x3
output: posterior
lab:'apple','banana','stone'
2 Parzen model                      2x3
output: probability density
lab:'apple','banana','stone'
``````

Any pipeline may be extracted from the stack by its index:

``````>> ps(1)
sequential pipeline       2x3 'Gaussian model+Normalization'
1 Gaussian model          2x3  full cov.mat.
2 Normalization           3x3
``````

## 17.2.2. Fixed combiners ↩

Fixed combiner is a rule applied to soft-outputs of the base classifiers. For example, we may wish to average-out errors of based classifiers. Therefore, we adopt a mean combination rule. It will create new soft output set with averages of each per-class output over all classifiers.

In perClas, fixed combiners are implemented through the `sdcombine` function. It takes a stacked pipeline as input and by default uses the mean combiner:

``````>> pc=sdcombine([-p1 -p2])
sequential pipeline       2x1 'stack+Fixed combiner+Decision'
1 stack                  2x6  2 classifiers in 2D space
2 Fixed combiner         6x3  mean rule
3 Decision               3x1  weighting, 3 classes
``````

### 17.2.2.1. Comparable soft output types ↩

It is important to note that all base classifiers must yield comparable soft outputs. Otherwise, the assumptions the combiner is based on do not hold. In our example above, the first base classifier returns posteriors while the second probability densities. These soft-outputs are not comparable.

Here we display soft outputs of our stack pipeline `ps` on first 10 samples in data set `a`:

``````>> +a(1:10)*ps

ans =

0.9963    0.0032    0.0005    0.0190    0.0012    0.0001
0.9827    0.0171    0.0001    0.0155    0.0011    0.0000
0.9651    0.0252    0.0097    0.0167    0.0007    0.0003
0.9989    0.0011    0.0000    0.0191    0.0004    0.0000
0.9971    0.0028    0.0000    0.0117    0.0004    0.0000
0.9982    0.0016    0.0002    0.0185    0.0011    0.0000
0.9877    0.0120    0.0004    0.0172    0.0021    0.0001
0.9987    0.0012    0.0001    0.0221    0.0006    0.0000
0.9786    0.0152    0.0062    0.0161    0.0003    0.0003
0.9884    0.0112    0.0004    0.0173    0.0020    0.0001
``````

Note that the first three columns sum to one (posteriors) while the last three do not (densities).

One way to make output of probabilist model comparable is to normalize them into posteriors with `sdnorm`. Lets us it on the Parzen pipeline `p2`:

``````>> sdnorm(-p2)
sequential pipeline       2x3 'Parzen model+Normalization'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Normalization           3x3
``````

We recreate the stacked pipeline:

``````>> ps2=[-p1 sdnorm(-p2)]
stack pipeline            2x6  2 classifiers in 2D space

>> ps2'
stacked pipeline        2x6 'stack'
inlab:'1','2'
1 Gaussian model+Normalization      2x3
output: posterior
lab:'apple','banana','stone'
2 Parzen model+Normalization        2x3
output: posterior
lab:'apple','banana','stone'

>> pc=sdcombine(ps2)
sequential pipeline       2x1 'stack+Fixed combiner+Decision'
1 stack                   2x6  2 classifiers in 2D space
2 Fixed combiner          6x3  mean rule
3 Decision                3x1  weighting, 3 classes
``````

### 17.2.2.2. Changing fixed combination rule ↩

The combination rule may be specified in the `sdcombine` second parameter:

``````>> pc=sdcombine(ps2,'prod')
sequential pipeline       2x1 'stack+Fixed combiner+Decision'
1 stack                   2x6  2 classifiers in 2D space
2 Fixed combiner          6x3  prod rule
3 Decision                3x1  weighting, 3 classes
``````

Available fixed fusion rules:

• 'mean' - averaging out errors of base models
• 'prod' - product when assuming independence of the base classifiers (e.g. trained in different feature spaces)
• 'min' - choose least sure model
• 'max' - useful to pick the most sure classifier. Classifiers must be properly trained.

## 17.2.3. Trained combiners ↩

Trained combiner is a second-order classifier trained on a data set comprised of soft-outputs of base classifiers.

The advantage of trained combiners is that we may use any based classifier models even having incompatible soft outputs. For the second stage classifier, soft-outputs are mere new features.

``````>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)

1 Gaussian model          2x3  full cov.mat.
2 Normalization           3x3
3 Decision                3x1  weighting, 3 classes

>> p2=sdparzen(a)
.....sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Decision                3x1  weighting, 3 classes

>> ps=[-p1 -p2]
stack pipeline            2x6  2 classifiers in 2D space
``````

Now we create a training set for the combiner. In this example, we reuse all training data:

``````>> b=a*ps
'Fruit set' 260 by 6 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
``````

The combiner is trained on the set `b`. We use simple linear discriminant:

``````>> pf=sdfisher(b)
sequential pipeline       6x1 'Fisher linear discriminant'
1 LDA                     6x2
2 Gaussian model          2x3  single cov.mat.
3 Normalization           3x3
4 Decision                3x1  weighting, 3 classes
``````

The final classifier is composed of a stack and the combiner:

``````>> pfinal=ps*pf
sequential pipeline       2x1 'stack+Fisher linear discriminant'
1 stack                   2x6  2 classifiers in 2D space
2 LDA                     6x2
3 Gaussian model          2x3  single cov.mat.
4 Normalization           3x3
5 Decision                3x1  weighting, 3 classes
``````

Testing our combined system on the independent set:

``````>> ts
'Fruit set' 20000 by 2 sddata, 3 classes: 'apple'(6667) 'banana'(6667) 'stone'(6666)
>> sdtest(ts,pfinal)

ans =

0.0920

>> sdscatter(a,pfinal)
``````

Note, that reuse of all training data for both stages may result in over-fitting of the combiner. Multiple strategies are possible such as validation set approach or stacked generalization (available in perClass via `sdstackgen` command).

For details, see e.g. P.Paclik, T.C.W.Landgrebe, D.M.J.Tax, R.P.W.Duin, On deriving the second-stage training set for trainable combiners, in proc. of Multiple Classifier Systems conference MCS 2005

# 17.3. Crisp combining of classifier decisions ↩

Second major approach of classifier fusion is to combine decisions of base classifiers, not their soft outputs. This has a major advantage in allowing us to optimize each of the base learner for the task at hand with ROC analysis. For example, in a defect detection system, we may tune each base classifier not to loose certain type of defect.

## 17.3.1. Stacks of classifiers returning decisions ↩

The `sdstack` function and horizontal concatenation can be used also on classifiers returning decisions.

Using our example from above, we will fuse the quadratic discriminant with Parzen classifier:

``````>> p1
1 Gaussian model          2x3  full cov.mat.
2 Normalization           3x3
3 Decision                3x1  weighting, 3 classes
>> p2
sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Decision                3x1  weighting, 3 classes
``````

Note that we provide classifier pipelines directly, without removing the decision steps:

``````>> pscrisp=[p1 p2]
decision stack pipeline   2x2  2 classifiers in 2D space

>> pscrisp'
stacked pipeline        2x2 'decision stack'
inlab:'1','2'
output: decision
list:'apple','banana','stone'
2 Parzen model+Decision             2x1
output: decision
list:'apple','banana','stone'
``````

## 17.3.2. Crisp combining with 'all agree' rule ↩

The `sdcombine` command takes a stack of classifiers on its input:

``````>> pc=sdcombine(pscrisp)
sequential pipeline       2x1 'decision stack+crisp combiner'
1 decision stack          2x2  2 classifiers in 2D space
2 crisp combiner          2x1  all agree rule
``````

By default, the 'all agree' rule is used. This yields the class output only if all base classifier agree with it. If not, the reject decision is given. We may see the rejected areas in the scatter plot:

``````>> sdscatter(a,pc)
``````

The reject decision can be adjusted by the 'reject' option:

``````>> pc=sdcombine(pscrisp,'reject','stone')
sequential pipeline       2x1 'decision stack+crisp combiner'
1 decision stack          2x2  2 classifiers in 2D space
2 crisp combiner          2x1  all agree rule
>> sdscatter(a,pc)
``````

## 17.3.3. Crisp combining with 'at least' rule ↩

An alternative crisp combination rule is 'at least' N agree.

We build a set of 10 decision trees. Recall, that a `sdtree` performs a random split inside to get a validation set for pruning. We use this to introduce some variability in our 10 base classifiers. Without variability in the models, there is no benefit on fusion.

``````>> P={}; for i=1:10, P{i}=sdtree(a); end

>> ps=sdstack(P)
decision stack pipeline   2x10  10 classifiers in 2D space
``````

The 'at least' combination rule takes an extra argument specifying how many base classifiers need to agree. We also need to set the target class to avoid tie situations:

``````>> pc=sdcombine(ps,'at least',8,'target','apple')
sequential pipeline       2x1 'decision stack+crisp combiner'
1 decision stack          2x10 10 classifiers in 2D space
2 crisp combiner         10x1  on 'apple' at least 8 of 10

>> sdscatter(a,pc)
``````

The 'at least' fusion rule is very useful, for example, in purifying high precision decisions in defect detection applications.

# 17.4. Hierarchical classifiers and cascades ↩

Cascading is a form of classifier combining based on decisions. All examples, that are labeled with a specific decision are passed to different classifier. This approach allows us to design multi-stage systems or use best features for each sub-problem. Classifier hierarchy is built with `sdcascade` command.

Let us take an example where we want to build a detector for fruit, followed by classification of specific fruit type.

``````>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
``````

First, we prepare a training set of the fruit detector, joining 'apple' and 'banana' classes into 'fruit':

``````>> b=sdrelab(a,{'~stone','fruit'})
1: apple  -> fruit
2: banana -> fruit
3: stone  -> stone
'Fruit set' 260 by 2 sddata, 2 classes: 'stone'(60) 'fruit'(200)
``````

Now we train a detector for fruit. We use a mixture model because we know about multi-modality of our 'fruit' class:

``````>> pd=sddetect(b,'fruit',sdmixture)
[class 'fruit' init:.......... 6 clusters EM:done  6 comp]
1: stone  -> non-fruit
2: fruit  -> fruit
sequential pipeline       2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians    2x1  6 components, full cov.mat.
2 Decision                1x1  ROC thresholding on fruit (52 points, current 17)
``````

We visualize classifier decisions on the complete set and show also the ROC plot (stored in the detector `pd`):

``````>> sdscatter(a,pd,'roc')
``````

We choose the operating point where we do not loose much of precious fruit and save it back into the detector `pd` by pressing 's' key.

``````>> Setting the operating point 12 in sdppl object pd
sequential pipeline       2x1 'Mixture of Gaussians+Decision'
1 Mixture of Gaussians    2x1  6 components, full cov.mat.
2 Decision                1x1  ROC thresholding on fruit (52 points, current 12)
``````

Now we can build a classifier separating 'apple' and 'banana' examples. Lets use Parzen classifier:

``````>> p2=sdparzen( a(:,:,1:2) )
....sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x2  200 prototypes, h=0.6
2 Decision                2x1  weighting, 2 classes
``````

Our detector returns 'fruit' and 'non-fruit' decisions:

``````>> pd.list
sdlist (2 entries)
ind name
1 fruit
2 non-fruit
``````

Now we can compose a cascade which first executes the `pd` detector on all samples. Samples, labeled 'fruit' will be passed on to the second-stage classifier `p2`:

``````>> pc=sdcascade(pd,'fruit',p2)
Classifier cascade pipeline 2x1  2 stages

>> sdscatter(a,pc)
``````

We can observe that our 'apple'/'banana' classifier is protected from all directions by the first stage detector.

## 17.4.1. Accessing operating points in a cascade ↩

The cascade is composed of several classifiers returning decisions:

``````>> pc'
inlab:'Feature 1','Feature 2'
1 Mixture of Gaussians+Decision     2x1
dec:'fruit','non-fruit'
2 Parzen model+Decision             2x1
applied on: 'fruit'
dec:'apple','banana'
``````

We may read-out the operating point or point set directly using the `ops` field:

``````>> pc(1).ops
Thr-based operating set (52 ops) at op 12
``````

## 17.4.2. Changing operating points in a cascade ↩

We may set the current operating point directly by providing a desired index from the ROC plot:

``````>> pc(1).curop=2
Classifier cascade pipeline 2x1  2 stages
``````

This will expand the detector boundary:

``````>> sdscatter(a,pc)
``````