perClass Documentation version 5.4 (7-Dec-2018)

kb26: Useful tips for confusion matrices

Keywords: confusion matrices, meta-data

Published on: 15-jan-2013

perClass version used: 3.4 (9-Oct-2012)

# 26.1. Confusion matrices ↩

Confusion matrix is an indispensable tool to understand the structure of classifier errors.

In this article, we show several practical tools in perClass getting the most of confusion matrices:

## 26.1.1. Constructing a confusion matrix ↩

Lets consider, for example, a medical data set:

``````>> a
'medical D/ND' 5762 by 10 sddata, 2 classes: 'no-disease'(4267) 'disease'(1495)
``````

The data set contains measurements from 16 patients:

``````>> a.patient
sdlab with 5762 entries, 16 groups
``````

We will use the first 8 to train a classifier and the remaining ones to estimate its performance:

``````>> [tr,ts]=subset(a,'patient',1:8)
'medical D/ND' 2920 by 10 sddata, 2 classes: 'no-disease'(2032) 'disease'(888)
'medical D/ND' 2842 by 10 sddata, 2 classes: 'no-disease'(2235) 'disease'(607)

>> sdscatter(tr)
``````

We can see that the classes are quite complex and, therefore, decide to use non-parametric Parzen classifier that does not impose any class shape assumptions. We train Parzen classifier on the training subset:

``````>> p=sdparzen(tr)
Parzen pipeline         10x2  2 classes, 2920 prototypes (sdp_parzen)
``````

And add a decision output with `sddecide` function:

``````>> pd=sddecide(p)
sequential pipeline     10x1 'Parzen+Decision'
1  Parzen                 10x2  2 classes, 2920 prototypes (sdp_parzen)
2  Decision                2x1  weighting, 2 classes, 1 ops at op 1 (sdp_decide)
``````

If executed on new data, the pipeline `pd` returns decisions:

``````>> dec=ts*pd
sdlab with 2842 entries, 2 groups: 'no-disease'(1527) 'disease'(1315)
``````

We construct the confusion matrix by providing true labels on the test set and the decisions of our classifier on the same samples:

``````>> sdconfmat(ts.lab,dec)

ans =

True        | Decisions
Labels      |  no-dis  diseas  | Totals
-----------------------------------------
no-disease  |   1479     756   |   2235
disease     |     48     559   |    607
-----------------------------------------
Totals      |   1527    1315   |   2842
``````

The confusion matrix shows how the decisions in columns fit the true classes in rows. For example, we can see that 48 disease samples were misclassified as no-disease.

## 26.1.2. Fixing the order of rows and columns ↩

Notice, that the classes and decisions are given in the order present in our test set and classifier, respectively. This may not be ideal. For example, in medical problems we consider 'disease' class as 'positive'. Therefore, we would like that it is first in the list so that the upper left corner of confusion matrix refers to true positives (correctly found disease).

The `sdconfmat` command allows us to specify the order of classes and decisions to make sure all our confusion matrices are comparable.

We may simply provide a cell array in desired order through 'classes' and 'decisions' options:

``````>> sdconfmat(ts.lab,dec,'classes',{'disease','no-disease'},'decisions',{'disease','no-disease'})

ans =

True        | Decisions
Labels      |  diseas  no-dis  | Totals
-----------------------------------------
disease     |    559      48   |    607
no-disease  |    756    1479   |   2235
-----------------------------------------
Totals      |   1315    1527   |   2842
``````

If we assign output of `sdconfmat` into a variable, we obtain a numerical matrix:

``````>> CM=sdconfmat(ts.lab,dec,'classes',{'disease','no-disease'},'decisions',{'disease','no-disease'})

CM =

559          48
756        1479
``````

With 'classes' and 'decisions' order fixed, we are sure that CM(1,1) refers to true positives and CM(2,1) to false positives.

This feature becomes even more important in everyday life where some of your test sets do not contain all the classes. For example, the last patient in our test set does not have any known diseased tissue (lucky man!). Default confusion matrix would have a different shape:

``````>> sub=subset(ts,'patient',8)
'medical D/ND' 400 by 10 sddata, class: 'no-disease'

>> sdconfmat(sub.lab,sub*pd)

ans =

True        | Decisions
Labels      |  no-dis  diseas  | Totals
-----------------------------------------
no-disease  |    135     265   |    400
-----------------------------------------
Totals      |    135     265   |    400
``````

With the 'classes' and 'decisions' options, we make sure the output is square and correctly ordered:

``````>> sdconfmat(sub.lab,sub*pd,'classes',{'disease','no-disease'},'decisions',{'disease','no-disease'})

ans =

True        | Decisions
Labels      |  diseas  no-dis  | Totals
-----------------------------------------
disease     |      0       0   |      0
no-disease  |    265     135   |    400
-----------------------------------------
Totals      |    265     135   |    400
``````

## 26.1.3. Extracting samples in a specific matrix field ↩

How do we find out what samples are false positives in the previous example? We may use the `sdconfmatind` command providing the true labels and decisions and asking for a specific class/decision combination:

``````>> ind=sdconfmatind(sub.lab,sub*pd,'no-disease','disease');
>> length(ind)

ans =

265
``````

The samples may be accessed easily by:

``````>> sub(ind)
'medical D/ND' 265 by 10 sddata, class: 'no-disease'
``````

We may, for example, change their label to easily visualize them in scatter plot:

``````>> sub(ind).lab='no-disease FP'
'medical D/ND' 400 by 10 sddata, 2 classes: 'no-disease'(135) 'no-disease FP'(265)
>> sdscatter(sub)
``````

## 26.1.4. Visualize data set structure with confusion matrices ↩

We usually think about a confusion matrix as a tool describing true labels and classifier decisions. However, we may also use it with great benefit on any two sets of labels representing the same objects.

For example, we may quickly check the mapping of patients to classes:

``````>> sdconfmat(ts.patient, ts.lab)

ans =

True      | Decisions
Labels    |  no-dis  diseas  | Totals
---------------------------------------
Irene     |    285      20   |    305
Monica    |    160      61   |    221
Nick      |    297      19   |    316
Olaf      |    336      64   |    400
Paul      |     86     314   |    400
Rob       |    362      38   |    400
Steffany  |    309      91   |    400
Tom       |    400       0   |    400
---------------------------------------
Totals    |   2235     607   |   2842
``````

Note the last heathy patient "Tom" we saw in the previous sections.

## 26.1.5. Clean and replace confusion matrix content ↩

For the last example, I kept one really special option of `sdconfmat`. We may easily alter the displayed confusion matrix output with string replacement rules.

Imagine we work with a multi-class problems such as digit recognition:

``````>> load digits
>> a
'Digits' 2000 by 256 sddata, 10 classes: [200  200  200  200  200  200  200  200  200  200]
>> a.lab'
ind name size percentage
1 0     200 (10.0%)
2 1     200 (10.0%)
3 2     200 (10.0%)
4 3     200 (10.0%)
5 4     200 (10.0%)
6 5     200 (10.0%)
7 6     200 (10.0%)
8 7     200 (10.0%)
9 8     200 (10.0%)
10 9     200 (10.0%)
``````

We train a classifier and get a confusion matrix on the whole set:

``````>> p=sdpca([],10)*sdparzen*sddecide
untrained pipeline 3 steps: sdpca+sdparzen+sdp_decide

>> pd=a*p
.sequential pipeline     256x1 'PCA+Parzen+Decision'
1  PCA                   256x10 44%% of variance (sdp_affine)
2  Parzen                 10x10 10 classes, 2000 prototypes (sdp_parzen)
3  Decision               10x1  weighting, 10 classes, 1 ops at op 1 (sdp_decide)

>> sdconfmat(a.lab,a*pd)

ans =

True      | Decisions
Labels    |       0       1       2       3       4       5       6       7       8       9  | Totals
-------------------------------------------------------------------------------------------------------
0         |    179       1       1       2       3       7       7       0       0       0   |    200
1         |      0     178       0       0       6       0       8       4       1       3   |    200
2         |      1       2     179       5       2       1       1       1       7       1   |    200
3         |      0       1       1     181       1       3       0       3       6       4   |    200
4         |      0      13       2       0     153       1       3       1       4      23   |    200
5         |      5       0       6      14       8     152       2       4       5       4   |    200
6         |      1      12       3       0       1       1     181       0       1       0   |    200
7         |      1       1       0       0       1       0       0     173       2      22   |    200
8         |      0       9       1      11       3       9       0       1     160       6   |    200
9         |      0       0       1       0      11       1       1       6       5     175   |    200
-------------------------------------------------------------------------------------------------------
Totals    |    187     217     194     213     189     175     203     193     191     238   |   2000
``````

The matrix is not too readable, especially, if we normalize it by the rows so that the entries represent error rates or performances:

``````>> sdconfmat(a.lab,a*pd,'norm')

ans =

True      | Decisions
Labels    |       0       1       2       3       4       5       6       7       8       9  | Totals
-------------------------------------------------------------------------------------------------------
0         |  0.895   0.005   0.005   0.010   0.015   0.035   0.035   0.000   0.000   0.000   | 1.00
1         |  0.000   0.890   0.000   0.000   0.030   0.000   0.040   0.020   0.005   0.015   | 1.00
2         |  0.005   0.010   0.895   0.025   0.010   0.005   0.005   0.005   0.035   0.005   | 1.00
3         |  0.000   0.005   0.005   0.905   0.005   0.015   0.000   0.015   0.030   0.020   | 1.00
4         |  0.000   0.065   0.010   0.000   0.765   0.005   0.015   0.005   0.020   0.115   | 1.00
5         |  0.025   0.000   0.030   0.070   0.040   0.760   0.010   0.020   0.025   0.020   | 1.00
6         |  0.005   0.060   0.015   0.000   0.005   0.005   0.905   0.000   0.005   0.000   | 1.00
7         |  0.005   0.005   0.000   0.000   0.005   0.000   0.000   0.865   0.010   0.110   | 1.00
8         |  0.000   0.045   0.005   0.055   0.015   0.045   0.000   0.005   0.800   0.030   | 1.00
9         |  0.000   0.000   0.005   0.000   0.055   0.005   0.005   0.030   0.025   0.875   | 1.00
-------------------------------------------------------------------------------------------------------
``````

Many of the entries are small numbers that only obscure the bigger picture.

With the 'replace' option of `sdconfmat`, we may specify how some strings in the final matrix will get replaced. For example, we may turn each 0.000 string into a simple dash:

``````>> sdconfmat(a.lab,a*pd,'norm','replace',{'0.000','  -  '})

ans =

True      | Decisions
Labels    |       0       1       2       3       4       5       6       7       8       9  | Totals
-------------------------------------------------------------------------------------------------------
0         |  0.895   0.005   0.005   0.010   0.015   0.035   0.035     -       -       -     | 1.00
1         |    -     0.890     -       -     0.030     -     0.040   0.020   0.005   0.015   | 1.00
2         |  0.005   0.010   0.895   0.025   0.010   0.005   0.005   0.005   0.035   0.005   | 1.00
3         |    -     0.005   0.005   0.905   0.005   0.015     -     0.015   0.030   0.020   | 1.00
4         |    -     0.065   0.010     -     0.765   0.005   0.015   0.005   0.020   0.115   | 1.00
5         |  0.025     -     0.030   0.070   0.040   0.760   0.010   0.020   0.025   0.020   | 1.00
6         |  0.005   0.060   0.015     -     0.005   0.005   0.905     -     0.005     -     | 1.00
7         |  0.005   0.005     -       -     0.005     -       -     0.865   0.010   0.110   | 1.00
8         |    -     0.045   0.005   0.055   0.015   0.045     -     0.005   0.800   0.030   | 1.00
9         |    -       -     0.005     -     0.055   0.005   0.005   0.030   0.025   0.875   | 1.00
-------------------------------------------------------------------------------------------------------
``````

That helps a bit. However, most of entries are still small numbers that are not really important to gain overall understanding.

The nice thing about `sdconfmat` replace is that it may contain any regular expression. For example, '0.00\d' means: '0.00' followed by one digit. This helps us to remove entries smaller than 1%:

``````>> sdconfmat(a.lab,a*pd,'norm','replace',{'0.00\d','  -  '})

ans =

True      | Decisions
Labels    |       0       1       2       3       4       5       6       7       8       9  | Totals
-------------------------------------------------------------------------------------------------------
0         |  0.895     -       -     0.010   0.015   0.035   0.035     -       -       -     | 1.00
1         |    -     0.890     -       -     0.030     -     0.040   0.020     -     0.015   | 1.00
2         |    -     0.010   0.895   0.025   0.010     -       -       -     0.035     -     | 1.00
3         |    -       -       -     0.905     -     0.015     -     0.015   0.030   0.020   | 1.00
4         |    -     0.065   0.010     -     0.765     -     0.015     -     0.020   0.115   | 1.00
5         |  0.025     -     0.030   0.070   0.040   0.760   0.010   0.020   0.025   0.020   | 1.00
6         |    -     0.060   0.015     -       -       -     0.905     -       -       -     | 1.00
7         |    -       -       -       -       -       -       -     0.865   0.010   0.110   | 1.00
8         |    -     0.045     -     0.055   0.015   0.045     -       -     0.800   0.030   | 1.00
9         |    -       -       -       -     0.055     -       -     0.030   0.025   0.875   | 1.00
-------------------------------------------------------------------------------------------------------
``````

If we want to get rid of entries smaller than 5%, we may specify '0.0[01234]\d' pattern saying '0.0' followed by any digit from the list [01234] and later by any arbitrary digit:

``````>> sdconfmat(a.lab,a*pd,'norm','replace',{'0.0[01234]\d','  -  '})

ans =

True      | Decisions
Labels    |       0       1       2       3       4       5       6       7       8       9  | Totals
-------------------------------------------------------------------------------------------------------
0         |  0.895     -       -       -       -       -       -       -       -       -     | 1.00
1         |    -     0.890     -       -       -       -       -       -       -       -     | 1.00
2         |    -       -     0.895     -       -       -       -       -       -       -     | 1.00
3         |    -       -       -     0.905     -       -       -       -       -       -     | 1.00
4         |    -     0.065     -       -     0.765     -       -       -       -     0.115   | 1.00
5         |    -       -       -     0.070     -     0.760     -       -       -       -     | 1.00
6         |    -     0.060     -       -       -       -     0.905     -       -       -     | 1.00
7         |    -       -       -       -       -       -       -     0.865     -     0.110   | 1.00
8         |    -       -       -     0.055     -       -       -       -     0.800     -     | 1.00
9         |    -       -       -       -     0.055     -       -       -       -     0.875   | 1.00
-------------------------------------------------------------------------------------------------------
``````

This last solution lets us to get a quick insight into overlap patterns in our problem.