perClass Documentation
version 5.4 (7-Dec-2018)

kb19: PRTools compatibility

Published on: 25-may-2011

perClass version used: 3.0.0 (6-jun-2011)

(Please note that starting with perClass 3.x (June 2011), internal parameters of pipelines are not directly accessible)

Problem: Can I use PRTools with PRSD Studio?

Solution: Yes, PRTools users may benefit from many seamlessly integrated PRSD Studio tools.

PRTools is an academic toolbox developed in PRLab at TU Delft. Although PRSD Studio is a self-contained software without external dependencies, it provides close integration with PRTools enabling you to:

In PRTools, data is stored in dataset objects. Although PRSD Studio directly accepts PRTools dataset objects in most situations, the two object formats may be easily converted if needed.

To convert PRTools dataset into sddata object use sddata function:

>> a=gendatb
Banana Set, 100 by 2 dataset with 2 classes: [50  50]

>> b=sddata(a)
'Banana Set' 100 by 2 sddata, 2 classes: '1'(50) '2'(50) 

Note that while PRTools supports numerical class names, PRSD Studio data sets always contain string class names. This is to avoid confusion between class indices and names. Only the default PRTools label set is converted into sddata object. If additional sets of labels are stored in PRTools dataset (either as labels or as custom targets), these need to be transferred manually.

To convert PRSD Studio data set object into PRTools format, use dataset function:

>> load medical
>> a
'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638) 

>> b=dataset(a)
6400 by 11 dataset with 3 classes: [1495  4267   638]

19.1. Leverage PRSD Studio interactive tools on PRTools datasets ↩

PRSD Studio interactive scatter may be used directly on PRTools dataset objects:

>> a=gendath
Highleyman Dataset, 100 by 2 dataset with 2 classes: [50  50]
>> sdscatter(a)

The scatter plot provides all functionality described in Chapter 6 including:

For more information, see short videos.

Class overlap may be easily inspected using interactive feature distribution plot sdfeatplot:

>> d
884 by 11 dataset with 3 classes: [140  194  550]
>> sdfeatplot(d)

Use up/down cursor keys to switch between features.

19.2. Train fast and scalable classifiers ↩

With PRSD Studio you may always choose which classifier best suits your needs. You may seamlessly switch between training PRTools classifiers on PRSD Studio data sets and leveraging PRSD Studio classifiers without leaving PRTools.

PRSD Studio provides a number of highly scalable algorithms that yield both high-speed training and execution on very large datasets. Lets take an example of k-means clustering on a data set with 180 000 samples. We will use the PRSD Studio command sdkmeans and ask for 100 clusters:

>> d
180000 by 2 dataset with 1 class: [180000]

>> tic; p=sdkmeans(d,100,'cluster'); toc
[class 'unknown' 100 clusters] 
Elapsed time is 3.283098 seconds.

The clustering took 3.3 seconds. The result is a trained model p we may apply to any data to get cluster labels.

>> tic; lab=d*sddecide(p); toc
Elapsed time is 0.346232 seconds.

For comparison, PRTools kmeans clustering of the same dataset took 26 minutes (PRSD Studio gives 440-times speedup):

>> tic; lab=kmeans(d,100); toc
Elapsed time is 1589.395077 seconds.

sdkmeans may be also used for multi-class data sets where it derives k-means prototypes for k-NN classifier. In this way, we may very quickly train highly scalable and robust non-parametric classifiers as illustrated in this video.

19.2.1. Training PRSD Studio classifiers on PRTools datasets ↩

Any PRSD Studio classifiers may be trained on PRTools dataset object directly, without conversion to sddata.

In this example, we train the Gaussian mixture model automatically estimating number of mixture components from the data using PRSD Studio sdmixture command:

>> a
Highleyman Dataset, 2000 by 2 dataset with 2 classes: [1004   996]

>> p=sdmixture(a)
[class '1' initialization: 1 cluster  EM:.............................. 1 comp] 
[class '2' initialization: 5 clusters  EM:.............................. 5 comp] 
Mixture of Gaussians pipeline 2x2  2 classes, 6 components (sdp_normal)

Alternative way is to multiply the dataset with an untrained pipeline object. In this example, we scale data by standardization and then train feed-forward neural network on PRTools dataset:

>> pu=sdscale*sdneural 
untrained pipeline 2 steps: sdscale+sdneural

>> p=a*pu
epochs (*100):..........
sequential pipeline     2x2 'standardization+Neural network'
 1  standardization         2x2  (sdp_affine)
 2  Neural network          2x2  (sdp_neural)

Note that to run this example, you don't need Neural Network Toolbox installed.

19.2.2. Training PRTools classifiers on PRSD Studio data sets ↩

Any PRTools classifier may be trained directly on PRSD Studio sddata object by multiplication with the respective PRTools untrained mapping.

In this example, we train PRTools Fisher linear discriminant directly on PRSD Studio medical data set:

>> b
'medical D/ND' 884 by 11 sddata, 3 'tissue' groups: 'disease'(140) 'no-disease'(550) 'muscle'(194) 

>> w=b*fisherc
Fisher, 11 to 3 trained  mapping   --> stacked

We may visualize the classifier decisions in multi-dimensional space using:

>> sdscatter(b,sddecide(w))

Note that invoking fisherc on sddata object b using standard function call results in error because PRTools does not recognize sddata object.

>> w=fisherc(b)
{??? Error using ==> isdataset at 24

Dataset expected.

Error in ==> isvaldfile at 32
            isdataset(a);

Error in ==> fisherc at 60
    isvaldfile(a,1,2); % at least 1 object per class, 2 classes
} 

To use this way of training, convert the sddata object into PRTools dataset:

>> w=fisherc(dataset(b))
Fisher, 11 to 2 trained classifier --> affine

19.3. Optimize PRTools classifiers with ROC analysis ↩

Performance of any PRTools classifier may be directly optimized using a comprehensive set of PRSD Studio ROC analysis tools.

This includes:

Example 1: Multi-class ROC analysis of a PRTools classifier

In this example, we perform multi-class ROC analysis for linear discriminant on 8-class problem:

>> a=gendatm(10000)
Multi-Class Problem, 10000 by 2 dataset with 8 classes: [1255  1230  1296  1211  1246  1263  1260  1239]

>> [tr,ts]=gendat(a,0.5)
Multi-Class Problem, 5002 by 2 dataset with 8 classes: [628  615  648  606  623  632  630  620]
Multi-Class Problem, 4998 by 2 dataset with 8 classes: [627  615  648  605  623  631  630  619]

>> w=ldc(tr)
PR_Warning: getprior: ldc: No priors found in dataset, class frequencies are used instead
Bayes-Normal-1, 2 to 8 trained  mapping   --> normal_map

Soft outputs of the trained classifier:

>> out=ts*w
Multi-Class Problem, 4998 by 8 dataset with 8 classes: [627  615  648  605  623  631  630  619]

ROC analysis using sdroc command:

>> tic; r=sdroc(out), toc
..........
ROC (2000 w-based op.points, 9 measures), curop: 1827
est: 1:err(a)=0.21, 2:err(b)=0.24, 3:err(c)=0.00, 4:err(d)=1.00, 5:err(e)=0.03, 6:err(f)=0.34, 7:err(g)=0.19, 8:err
Elapsed time is 3.885148 seconds.

We may visualize the ROC surface using sddrawroc command:

>> sddrawroc(r);

We may then use a number of methods to select desired operating point including application of constraints or cost-sensitive optimization. For example, we may require that the error on class f is smaller than 20%:

>> r2=constrain(r,'err(f)',0.2)
ROC (705 w-based op.points, 9 measures), curop: 1
est: 1:err(a)=0.27, 2:err(b)=0.18, 3:err(c)=0.00, 4:err(d)=1.00, 5:err(e)=0.20, 6:err(f)=0.19, 7:err(g)=0.20, 8:err

Once ready, we may get decisions at this operating point for new examples:

>> newdata=[1 0; -10 10]

newdata =

     1     0
   -10    10

>> dec=sddata(newdata)*w*r2
sdlab with 2 entries, 2 groups: 'a'(1) 'h'(1) 

19.4. Quickly deploy classifiers out of Matlab ↩

PRSD Studio offers an easy-to-use set of tools for classifier deployment in custom applications. Main benefits are:

Any supported trained PRTools mapping may be converted into PRSD Studio pipeline using sdconvert command:

>> b
200 by 2 dataset with 2 classes: [100  100]

>> w=fisherc(b)
Fisher, 2 to 2 trained classifier --> affine

>> p=sdconvert(w)
sequential pipeline     2x2 'Fisher'
 1  Fisher                  2x2  (sdp_affine)
 2  Sigmoid                 2x2  (sdp_sigmoid)

Pipeline objects may be exported for execution outside Matlab using the sdexport command. For information on classifier deployment, see the Chapter 14.

One-step conversion of a PRTools mapping into pipeline and adding an operating point is possible using the sddecide function. Below, you can see an example of exporting AdaBoost classifier, trained in PRTools, for execution outside Matlab.

19.4.1. Supported PRTools mappings ↩

The PRSD deployment technology seamlessly supports more than 40 most common PRTools classifiers. Also any sequential, stacked (same feature space) or parallel (different feature spaces) combination is supported.