kb13: How to find samples with a specific type of error in a confusion matrix?

Published on: 17-sep-2015 (updated for 4.6)

perClass version used: 4.6 (29-jun-2015)

Problem: To find out what samples suffer from a specific type of error (defined by a confusion matrix)

Solution: Use the `sdconfmatind` function to find indices of samples in a specific cell of a confusion matrix.

Let us assume a two class banana dataset split into a training and test set:

``````>> load fruit; a=a(:,:,[1 2])
``````

'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)

``````>> [tr,ts]=randsubset(a,0.5)
Banana Set, 50 by 2 dataset with 2 classes: [25  25]
``````

We train a Gaussian model, apply the model to the test set, and obtain the classifier decisions (`dec`):

``````>> p=sdgauss(tr)
sequential pipeline       2x1 'Gaussian model+Decision'
1 Gaussian model          2x2  full cov.mat.
2 Decision                2x1  weighting, 2 classes

>> dec=ts*p
sdlab with 100 entries, 2 groups: 'apple'(57) 'banana'(43)
``````

The confusion matrix compares the ground-truth labels, stored in the test dataset `ts`, to the decisions `dec`:

``````>> sdconfmat(ts.lab,ts*p)

ans =

True      | Decisions
Labels    |   apple  banana  | Totals
---------------------------------------
apple     |     46       4   |     50
banana    |     11      39   |     50
---------------------------------------
Totals    |     57      43   |    100
``````

We would now like to find out, what are the 4 `apple` samples that are misclassified as `banana` by our classifier. We use the `sdconfmatind` function providing it with ground truth labels, decisions and the true and estimated class defining the confusion matrix cell (here 'apple' and 'banana'):

``````>> ind=sdconfmatind(ts.lab,ts*p,'apple','banana')

ind =

15
23
25
48
``````

The four test samples are:

``````>> ts(ind)
'Fruit set' 4 by 2 sddata, class: 'apple'
``````

How to find out these samples in the original data set `a`? It is, actually, quite easy! When we display details on data set `a` with the transpose operator...

``````>> a'
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
sample props: 'lab'->'class' 'class'(L) 'ident'(N)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)
``````

... we can see there is a numerical `ident` field. We have included an index of each sample using:

``````>> a.ident=1:length(a)
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
``````

We may retrieve the `ident` property on any sample:

``````>> a(10).ident

ans =

10
``````

Therefore, we may quickly see original indices of misclassified test samples:

``````>> ts(ind).ident

ans =

33
52
55
97
``````