This chapter describes pipelines, trainable operations on data.
- 11.1. Introduction
- 11.1.1. Execution on new data
- 11.1.2. Accessing pipeline steps
- 11.1.3. Displaying pipeline details
- 11.1.4. Untrained pipelines
11.1. Introduction ↩
In perClass, processing or transformation of data is described using the concept of a pipeline. Let us take, as an example, training of a linear classifier:
>> load fruit
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> p=sdlinear
(a)
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
3 Decision 3x1 weighting, 3 classes
The object p
is a pipeline comprised of three stages. The first one is a
Gaussian model computed in the input 2D feature space. The model describes
three classes ('apple','banana' and 'stone') and, therefore, provides three
corresponding outputs (probability densities). The second stage is a
normalization turning the density into posterior. Finally, the third stage
converts the posteriors in a decision providing a single integer output.
Pipelines in perClass are not limited to classification. They describe all types of data processing, including data scaling, feature extraction and selection.
11.1.1. Execution on new data ↩
The pipeline p
may be applied to any data set with two features using the
multiplication operator *
:
>> data=sddata(data)
5 by 2 sddata, class: 'unknown'
>> out=data*p
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)
Output of our pipeline is an sdlab
object with classifier decisions. In
perClass 4, all classifiers produce decisions by default.
The pipeline execution is an analogy of a matrix multiplication. Our
pipeline p
acts as a matrix with two rows ( feature inputs) and one
output (decision).
The multiplication operator is only syntactic sugar, the real work is done
by sdexe
function:
>> sdexe
(p,data)
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)
If we execute pipeline on raw data matrix, we obtain raw numerical output:
>> data=rand(5,2)*100
data =
41.4248 77.6399
36.8954 4.8470
85.0896 59.0271
79.7602 15.8238
35.0236 93.7622
>> out=data*p
out =
3
1
1
1
3
The mapping between integer decisions and decision names is handled by pipeline list:
>> p.list
sdlist (3 entries)
ind name
1 apple
2 banana
3 stone
We may use the list object to convert decisions into names and vice versa:
>> p.list(3)
ans =
stone
>> p.list('apple')
ans =
1
>> p.list(out)
ans =
stone
apple
apple
apple
stone
11.1.2. Accessing pipeline steps ↩
Unless specified explicitly, all pipeline operations refer to the last step:
>> p
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
3 Decision 3x1 weighting, 3 classes
>> p.output
ans =
decision
We may access individual pipeline steps using parentheses ()
:
>> p(1).output
ans =
probability density
Say, we wish to extract "soft outputs" of our classifier just before turning them into decisions:
>> p(1:2)
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
>> out=data*p(1:2)
5 by 3 sddata, class: 'unknown'
The output is now a data set, because the second pipeline step returns real-value output.
A quick shorthand for removing decision step is a unary minus (-
)
operator:
>> -p
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
We may, therefore, get classifier soft outputs using:
>> data*-p
5 by 3 sddata, class: 'unknown'
Applying the unary minus to a data set which already returns soft output has no effect:
>> --p
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
11.1.3. Displaying pipeline details ↩
Similarly to data sets and labels, perClass provides a quick shortcut for
displaying details about a pipeline with a transpose operator ('
):
>> p'
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
inlab: 'length','color'
lab: 'apple','banana','stone'
output: probability density
2 Normalization 3x3
inlab: 'apple','banana','stone'
lab: 'apple','banana','stone'
output: posterior
3 Decision 3x1 weighting, 3 classes
inlab: 'apple','banana','stone'
output: decision ('apple','banana','stone')
For each step, we can see the input/output labels and the type of
output. We can see, that out pipeline p
expects two input features,
namely 'length' and 'color'.
This information may be accessed using pipeline fields inlab
, lab
and
output
:
>> p(1).inlab
sdlab with 2 entries: 'length','color'
>> p(3).output
ans =
decision
11.1.4. Untrained pipelines ↩
Usually, we create pipelines by training them on a data set. However, in some situations, it may be more beneficial to create a pipeline description without a concrete data set. Such pipeline is called untrained.
An untrained pipeline is created by providing the first empty []
.
The trained Parzen classifier:
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
>> p=sdparzen
(a)
.....sequential pipeline 2x1 'Parzen model+Decision'
1 Parzen model 2x3 260 prototypes, h=0.8
2 Decision 3x1 weighting, 3 classes
The untrained Parzen classifier:
>> u=sdparzen
([])
untrained pipeline 'sdparzen'
By multiplying a data set with untrained pipeline, we train it:
>> p2=a*u
.....sequential pipeline 2x1 'Parzen model+Decision'
1 Parzen model 2x3 260 prototypes, h=0.8
2 Decision 3x1 weighting, 3 classes
Note, that the order is always data * pipeline
.
Untrained pipelines are useful to separate the definition of a classifier from its training on data. We may provide any parameters when defining an untrained pipeline:
>> u2=sdneural
([],'units',20,'iters',1000)
untrained pipeline 'sdneural'
Untrained pipelines are used, for example, by sdcrossval
to perform
evaluation by cross-validation:
>> sdcrossval
(u,a)
10 folds: [1: ....] [2: .....] [3: ....] [4: ....] [5: .....] [6: .....] [7: ....] [8: .....] [9: .....] [10: ....]
ans =
10-fold rotation
ind mean (std) measure
1 0.09 (0.02) mean error over classes, priors [0.3,0.3,0.3]
>> sdcrossval
(u2,a)
10 folds: [1: ] [2: ] [3: ] [4: ] [5: ] [6: ] [7: ] [8: ] [9: ] [10: ]
ans =
10-fold rotation
ind mean (std) measure
1 0.08 (0.01) mean error over classes, priors [0.3,0.3,0.3]