Keywords: confusion matrices

Problem: To find out what samples suffer from a specific type of error (defined by a confusion matrix)

Solution: Use the `sdconfmatind`

function to find indices of samples in a specific cell of a confusion matrix.

Let us assume a two class banana dataset split into a training and test set:

**>> load fruit; a=a(:,:,[1 2])**

'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)

**>> [tr,ts]=**`randsubset`

(a,0.5)
Banana Set, 50 by 2 dataset with 2 classes: [25 25]
Banana Set, 50 by 2 dataset with 2 classes: [25 25]

We train a Gaussian model, apply the model to the test set, and obtain the classifier decisions (`dec`

):

**>> p=**`sdgauss`

(tr)
sequential pipeline 2x1 'Gaussian model+Decision'
1 Gaussian model 2x2 full cov.mat.
2 Decision 2x1 weighting, 2 classes
**>> dec=ts*p**
sdlab with 100 entries, 2 groups: 'apple'(57) 'banana'(43)

The confusion matrix compares the ground-truth labels, stored in the test dataset `ts`

, to the decisions `dec`

:

**>> **`sdconfmat`

(ts.lab,ts*p)
ans =
True | Decisions
Labels | apple banana | Totals
---------------------------------------
apple | 46 4 | 50
banana | 11 39 | 50
---------------------------------------
Totals | 57 43 | 100

We would now like to find out, what are the 4 `apple`

samples that are
misclassified as `banana`

by our classifier. We use the `sdconfmatind`

function providing it with ground truth labels, decisions and the true and
estimated class defining the confusion matrix cell (here 'apple' and
'banana'):

**>> ind=**`sdconfmatind`

(ts.lab,ts*p,'apple','banana')
ind =
15
23
25
48

The four test samples are:

**>> ts(ind)**
'Fruit set' 4 by 2 sddata, class: 'apple'

How to find out these samples in the original data set `a`

? It is,
actually, quite easy! When we display details on data set `a`

with the
transpose operator...

**>> a'**
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
sample props: 'lab'->'class' 'class'(L) 'ident'(N)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)

... we can see there is a numerical `ident`

field. We have included an
index of each sample using:

**>> a.ident=1:length(a)**
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)

We may retrieve the `ident`

property on any sample:

**>> a(10).ident**
ans =
10

Therefore, we may quickly see original indices of misclassified test samples:

**>> ts(ind).ident**
ans =
33
52
55
97