perClass Documentation version 5.4 (7-Dec-2018)

kb18: How to protect a trained discriminant against outliers?

Published on: 4-oct-2010

perClass version used: 2.2.4 (5-oct-2010)

(Please note that starting with perClass 3.x (June 2011), internal parameters of pipelines are not directly accessible)

Problem: How can I protect a multi-class discriminant against accepting outliers?

Solution: Add a rejection threshold to the discriminant operating point.

Classifiers we train are often executed in environments where new types of measurements appear that were not considered during classifier design. For example, in a fruit sorting problem our classifier distinguishing several types of fruit may also encounter stones, leaves or dirt on the conveyor belt. Accepting stones or dirt as one of the fruit classes results in high sorting error.

In this tutorial, we discuss how to protect the trained multi-class discriminants from accepting such outliers.

The approach we take in this example is adding a reject option to a trained discrimiant. This method does not use any outlier examples during training. Note, however, that some outlier examples are still needed for the sake of evaluation.

# 18.1. Fruit data set example ↩

Our data set contains three classes, namely the apple and banana fruit and some stones we have observed. Our goal is to discriminate apple and banana while protecting the decision to any potential outliers.

``````>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)

>> sdscatter(a)
`````` Let us first split our data set into training and test subsets. As mentioned before, we will not consider stone during training but only in testing phase. Using `randsubset` method, we may randomly sample only some of the classes:

``````>> [tr,ts]=randsubset(a,[50 50 0])
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50)
'Fruit set' 160 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(60)
``````

# 18.2. Training a discriminant ↩

We will now train a model of interest on the two-class fruit problem `tr`. In this example, we use the Parzen classifier:

``````>> p=sdparzen(tr)
...Parzen pipeline         2x2  2 classes, 100 prototypes (sdp_parzen)
``````

To provide decisions, we need to explicitly add a desired operating point using `sddecide`. We will use a default setting with equal weight for each class output:

``````>> pd=sddecide(p)
sequential pipeline     2x1 'Parzen+Decision'
1  Parzen                  2x2  2 classes, 100 prototypes (sdp_parzen)
2  Decision                2x1  weighting, 2 classes, 1 ops at op 1 (sdp_decide)
``````

We visualize the decisions of our two-class discriminant `pd` on the test set `ts`:

``````>> sdscatter(ts,pd)
`````` We may observe that the existing stones (green markers) are assigned into one of the two classes.

``````>> sdconfmat(ts.lab,ts*pd)

ans =

True      | Decisions
Labels    |  apple banana  | Totals
-------------------------------------
apple     |    49      1   |    50
banana    |     0     50   |    50
stone     |     3     57   |    60
-------------------------------------
Totals    |    52    108   |   160
``````

# 18.3. Adding reject option to the discriminant ↩

Let us now add the reject option to the operating point in `pd` using the `sdreject` command. `sdreject` will add a threshold on the maximum weighted output of the discriminant in `pd`. The threshold value is selected so, that a specific percentage of data is rejected.

``````>> pr=sdreject(pd,tr)
Weight-based operating point,2 classes,[0.50,0.50]
sequential pipeline     2x1 'Parzen+Decision'
1  Parzen                  2x2  2 classes, 100 prototypes (sdp_parzen)
2  Decision                2x1  weight+reject, 3 decisions, ROC 1 ops at op 1 (sdp_decide)
``````

The resulting pipeline `pr` returns three decisions:

``````>> pr.list
sdlist (3 entries)
ind name
1 apple
2 banana
3 reject
``````

As we may see on the training set, 1% of samples is rejected by default:

``````>> sdconfmat(tr.lab,tr*pr)

ans =

True      | Decisions
Labels    |  apple banana reject  | Totals
--------------------------------------------
apple     |    48      1      1   |    50
banana    |     2     48      0   |    50
--------------------------------------------
Totals    |    50     49      1   |   100
``````

When executed on the test set, our new classifier with reject option `pr` rejects the most of the stone samples:

``````>> sdconfmat(ts.lab,ts*pr)

ans =

True      | Decisions
Labels    |  apple banana reject  | Totals
--------------------------------------------
apple     |    46      1      3   |    50
banana    |     0     49      1   |    50
stone     |     0      9     51   |    60
--------------------------------------------
Totals    |    46     59     55   |   160
``````

Finally, we visualize the decisions of the classifier with reject option on the test set:

``````>> sdscatter(ts,pr)
`````` # 18.4. Building reject curve ↩

Instead of fixing the fraction of rejection manually, we may build entire reject curve relating multiple fractions to performance. This is achieved using `sdroc` command with `'reject'` option.

Similarly, to standard ROC analysis, we first need to estimate soft outputs of our trained model:

``````>> out=tr*p
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50)
``````

Now we invoke `sdroc` command with `'reject'` option:

``````>> r=sdroc(out,'reject')
ROC (1001 wr-based op.points, 3 measures), curop: 1
est: 1:frac(reject)=0.00, 2:TPr(apple)=0.98, 3:TPr(banana)=0.96
``````

By default, fraction of rejection and per-class true positive rates (recalls) are estimated.

To visualize the interactive ROC and scatter plots, use `sdscatter` command:

``````>> sdscatter(ts,p*r,'roc',r)
`````` Note that we may visualize the test set containing additional stone examples. Moving the mouse over the ROC plot, we may investigate changes to the classifier boundary due to different rejection threshold used.

# 18.5. What discriminant models can be used for outlier rejection? ↩

Not all statistical models may be used for outlier rejection. Only the models that output probability density or distance can reject outliers. If the discriminant outputs are normalized over a set of classes, the domain information is lost and cannot be recovered. Adding a reject option will result in rejection close to the decision boundary (area of low confidence), not outlier rejection.

As an illustration, we may visualize the decisions of a classifier that is built on top of Parzen with outputs normalized to sum to one (aposteriori probabilities):

``````>> pm=sdnorm(p)
sequential pipeline     2x2 'Parzen+Output normalization'
1  Parzen                  2x2  2 classes, 100 prototypes (sdp_parzen)
2  Output normalization    2x2  (sdp_norm)
>> pr2=sdreject(pm,tr)
sequential pipeline     2x1 'Parzen+Output normalization+Decision'
1  Parzen                  2x2  2 classes, 100 prototypes (sdp_parzen)
2  Output normalization    2x2  (sdp_norm)
3  Decision                2x1  weight+reject, 3 decisions, 1 ops at op 1 (sdp_decide)
>> sdscatter(ts,pr2)
`````` PRSD Studio models appropriate for outlier detection:

• `sdgauss`
• `sdmixture`
• `sdparzen`
• `sdknn` - using the default distance output ("kappa" method). However, the class fraction `'classfrac'` method yields normalized output and cannot be used for outlier rejection.
• `sdkmeans`
• `sdkcentres`