- 3.1. Quick perClass installation
- 3.2. Importing data
- 3.3. Training a fruit classifier
- 3.4. Decisions and performance estimation
- 3.5. Classifier confidences
- 3.6. Classifier execution out of Matlab
This chapter provides a simple example of importing data set to perClass, training a classifier and executing it on new data outside of Matlab.
3.1. Quick perClass installation ↩
Installing perClass is very simple. Add perclass
and data
sub-directories from perClass distribution to your Matlab path. This can
be done either via File/Set Path menu or with addpath
command.
Example: If your perClass distribution is available in c:\software\perClass_Demo
:
>> addpath c:\software\perClass_Demo\perclass
>> addpath c:\software\perClass_Demo\data
Quick way to test if perClass is installed correctly is to run sdversion
command:
>> sdversion
perClass 4.3 (07-May-2014), Copyright (C) 2007-2014, PR Sys Design, All rights reserved
Customer: PR Sys Design Issued: 3-feb-2014
Toolbox with DB,imaging: The license expires on 1-jul-2014.
Installation directory: '/Users/pavel/Desktop/perclass'
For details on perClass installation process see Chapter: Installation.
3.2. Importing data ↩
In this example, we will use "Fruit" data set. It is stored in the text
file fruit.txt
in the data
sub-directory of perClass distribution.
To create a perClass data set, we will first need to import it into Matlab.
fruit.txt
is a comma-separated text file with each row representing one
data sample. The first two columns correspond to two features, the third
column contains string class label.
3.993477,-0.535440,apple
-4.922709,2.519118,stone
-0.052968,-4.946727,apple
2.364367,-5.600644,apple
3.129976,-4.014243,apple
-8.996251,-4.330067,banana
-2.155181,-0.548931,stone
....
We may import such data using perClass sdimport
command that allows us to
load both the data matrix and labels easily:
>> a=sdimport
('fruit.txt','data',1:2,'lab',3)
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
We obtain variable a
which is an sddata
object.
The data set a
contains measurements on 260 objects, each represented by
two features. Each object also has a class label which corresponds to one
of the three classes. The 'apple' and 'banana' classes represent genuine
fruit to be processed on the extracted from the conveyor belt. The samples
labeled as 'stone' are the outliers that should be rejected.
The data set object is a data matrix augmented with meta-data information such as sample labels or feature names. The samples are stored as data rows and features as columns.
Instead of sdimport
, we may also load data matrix and string labels
separately by Matlab command or custom scripts and construct an
sddata
set simply by:
>> a=sddata
(data,sdlab
(lab))
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
Here, data
is a numerical matrix with samples as rows and features as
columns and lab
are sample labels in a character array or cell array.
We can assign meaningful names of features to featlab
field of our data
set. Say, the first feature describes length and the second color of our
objects:
>> a.featlab=sdlab
('length','color')
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
We will visualize the scatter plot of the Fruit data set using
>> sdscatter
(a)
Each marker represents one data sample, marker color and shape encode the class.
Our goal is to build a classifier distinguishing between apples and bananas and discarding any other observation including known stones. Such statistical classifier is build by training it on a set of labeled observations. In order to also test its perfomance, we need to use data unseen in training. Only then is our performance estimate realistic.
Therefore, we need to keep aside a subset of data for testing. We will split our available data set into training and test subsets. We will use 50% of data for training the classifier and the rest for estimating its performance:
>> [tr,ts]=randsubset
(a,0.5)
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30)
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30)
3.3. Training a fruit classifier ↩
We will train a classifier discriminating between the fruit classes. First, let us extract the subset with samples labeled as 'apple' and 'banana'. In perClass, the third dimension represents classes. Therefore, we may extract a subset simply by listing class names:
>> tr2=tr(:,:,{'apple','banana'})
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50)
Anywhere in perClass, you may use indices instead of names. The apple and banana are first and second classes in the list. Therefore, we may get the test set as:
>> ts2=ts(:,:,1:2)
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50)
In order to capture the specific shape of the class distributions, we use the feed-forward neural network:
>> p=sdneural
(tr2)
sequential pipeline 2x1 'Neural network'
1 Neural network 2x2 10 units
2 Decision 2x1 weighting, 2 classes
The output of sdneural
command is a trained classifier, in perClass
represented by a "pipeline". Pipeline is a sequence of operations that may
be applied to new data. In the example above, the pipeline p
gets trained
on data set tr
. Note that starting with perClass 4, every classifier
returns decisions by default.
We may visualize classifier decisions on the training set using sdscatter
:
>> sdscatter
(tr2,p)
The backdrop color indicates classifier decisions in each position of the feature space.
3.4. Decisions and performance estimation ↩
We may execute the pipeline p
on new data using the multiplication
operator:
>> dec=ts2*p
sdlab with 130 entries, 2 groups: 'apple'(52) 'banana'(78)
In perClass, decisions or labels are represented by sdlab
objects. The
ground-truth labels in the test set ts
are accessible by:
>> ts2.lab
sdlab with 100 entries, 2 groups: 'apple'(50) 'banana'(50)
We may compare true labels and decisions using the confusion matrix:
>> sdconfmat
(ts2.lab,dec)
ans =
True | Decisions
Labels | apple banana | Totals
---------------------------------------
apple | 49 1 | 50
banana | 3 47 | 50
---------------------------------------
Totals | 52 48 | 100
We can see that, on our test set, three banana examples are misclassified as apple. To find our these problematic examples, we may use simple logical operations on labels:
>> ind=find(ts2.lab=='banana' & dec=='apple')
ind =
72
76
100
>> ts2(ind)
'Fruit set' 3 by 2 sddata, class: 'banana'
How to estimate mean classification error of our classifier? We may use the
sdtest
function:
>> sdtest
(ts2.lab,dec)
ans =
0.0400
sdtest
offers a number of other performance measures. Maybe, in our
application, we prefer sensitivity and precision, considering banana as a
target class:
>> sdtest
(ts2.lab,dec,'measures',{'sensitivity','banana','precision','banana'})
ans =
0.9400 0.9792
3.5. Classifier confidences ↩
Often, we need to know not only the eventual decision of a classifier, but the level of confidence. How to inspect "soft" classifier output?
In perClass, we may see a detailed information on any object using '
transpose operator:
>> p'
sequential pipeline 2x1 'Neural network'
1 Neural network 2x2 10 units
inlab: 'length','color'
lab: 'apple','banana'
output: confidence
2 Decision 2x1 weighting, 2 classes
inlab: 'apple','banana'
output: decision ('apple','banana')
We can see, that our pipeline is a sequence of two steps, namely the neural network model and the decision. The first step returns confidences.
We may access each step by its index:
>> p(1)
Neural network pipeline 2x2 10 units
Let us take three test examples and estimate confidence of their classification using our neural network:
>> ts2([10 70 89])
'Fruit set' 3 by 2 sddata, 2 classes: 'apple'(1) 'banana'(2)
Unary plus (+
) operator provides a quick shorthand to extract content of
a data set, returning the data matrix D
:
>> D=+ts2([10 70 89])
D =
6.0009 1.2879
-5.5805 -7.7396
0.1531 0.5740
Multiplying the data matrix with entire pipeline gives us decisions.
>> D*p
ans =
1
2
2
>> p.list
sdlist (2 entries)
ind name
1 apple
2 banana
Because we now provide a data matrix as the input, we receive the numerical
vector with integer decisions (and not sdlab
object as we did earlier).
Applying only the first pipeline step yields the desired confidence values:
>> D*p(1)
ans =
0.8677 0.1933
0.2150 0.8362
0.0366 0.9205
A quick shortcut to strip off decision step from any pipeline is unary
minus operator (-
):
>> p
sequential pipeline 2x1 'Neural network'
1 Neural network 2x2 10 units
2 Decision 2x1 weighting, 2 classes
>> -p
Neural network pipeline 2x2 10 units
We may visualize the confidences in our feature space using sdscatter
:
>> sdscatter
(ts2,-p)
ans =
3
Use arrows on the toolbar or cursor keys to move between the two per-class confidence outputs.
3.6. Classifier execution out of Matlab ↩
Eventually, we want to run our classifier outside of Matlab in a machine or a custom application. perClass provides a simple mechanism to export classifiers for execution using perClass Runtime library. This functionality is available in the Pro or Enterprise versions.
>> sdexport
(p,'classifier1.ppl')
Exporting pipeline..ok
This pipeline requires perClass runtime version 4.0 (29-mar-2013) or higher.
Apart from standard C/C++ API, perClass distribution comes with sdrun
command-line tool which can execute any exported classifier from command
line without additional programing.
Open the command-line for your platform (in MS Windows press Windows key+R
to open the Run dialog and enter cmd
). Change current directory to
perclass\interfaces\sdrun\YOUR_PLATFORM
and type
> ./sdrun.exe PATH_TO_EXPORTED_CLASSIFIER\classifier1.ppl
where PATH_TO_EXPORTED_CLASSIFIER
is the current directory in your Matlab
session (use pwd
command to display current directory).
It will print basic information on the pipeline classifier1.ppl
such as:
Pipeline name: 'Neural network'
Minimum required runtime version: 4.0 (29-mar-2013)
Input type: double, dimensionality: 2
Output type: int, dimensionality: 1, decisions
Operating point count: 1, current: 1
Possible decisions: 1:apple, 2:banana
We may directly provide the numerical values of input features, samples separated by semicolons:
> ./sdrun.exe classifier1.ppl -d "6.0009 1.2879; -5.5805 -7.7396; 0.1531 0.5740"
apple
banana
banana
We have received the same decisions as in Matlab. To compute confidences at runtime, export the pipeline returning soft outputs:
>> sdexport
(-p,'classifier2.ppl')
Exporting pipeline..ok
This pipeline requires perClass runtime version 4.0 (29-mar-2013) or higher.
> ./sdrun.exe classifier2.ppl -d "6.0009 1.2879; -5.5805 -7.7396; 0.1531 0.5740"
0.867726,0.193307
0.214956,0.836249
0.036589,0.920537
perClass Runtime may be embedded into custom applications using standard C/C++ interface or a separate .Net interface.