1. Tutorial

This chapter is a tutorial that will walk you through the basic concepts from a user-level perspective.

We assume you have a copy of the plearn distribution, and a working plearn executable accessible through yout PATH. All the files in this tutorial are in examples/Tutorial/ so you should first cd to this directory.

1.1 The plearn Commands and Help

Usual PLearn executables such as plearn or plearn_light are typically called in command-line fashion.

valhalla:~/PLearn/examples/Tutorial> plearn
plearn 0.92.0  (Jun 21 2005 12:04:50)
Type 'plearn help' for help


valhalla:~/PLearn/examples/Tutorial> plearn help
plearn 0.92.0  (Jun 21 2005 12:04:50)
To run a .plearn script type:                       plearn scriptfile.plearn
To run a command type:                              plearn command [ command arguments ]

To get help on the script file format:              plearn help scripts
To get a short description of available commands:   plearn help commands
To get detailed help on a specific command:         plearn help <command_name>
To get help on a specific PLearn object:            plearn help <object_type_name>
To get help on datasets:                            plearn help datasets

The plearn executable can be invoked either with a PLearn script (more on that later) or with a PLearn command.
To get the list of available commands:

valhalla:~/PLearn/examples/Tutorial> plearn help commands
plearn 0.92.0  (Jun 21 2005 12:04:50)
To run a command, type:  % plearn command_name command_arguments

Available commands are:
FieldConvert    :  Reads a dataset and generates a .vmat file based on the data, but optimized for training.

autorun :  watches files for changes and reruns the .plearn script
help    :  plearn command-line help
htmlhelp        :  Output HTML-formatted help for PLearn
jdate   :  Convert a Julian Date into a JJ/MM/YYYY date
ks-stat :  Computes the Kolmogorov-Smirnov statistic between 2 matrix columns
learner :  Allows to train, use and test a learner
read_and_write  :  Used to check (debug) the serialization system
run     :  runs a .plearn script
server  :  Launches plearn in computation server mode
test-dependencies       :  Compute dependency statistics between input and target variables.
test-dependency :  Compute dependency statistics between two selected columns of a vmat.
vmat    :  Examination and manipulation of vmat datasets


For more details on a specific command, type:
  % plearn help <command_name>

PLearn commands accept a number of arguments that are command specific. Very often the first argument is itself a sub-command...

help is actually a PLearn command! Thus we can ask help on help!

valhalla:~/PLearn/examples/Tutorial> plearn help help
plearn 0.92.0  (Jun 21 2005 12:04:50)
*** Help for command 'help' ***
plearn command-line help
help <topic>
Run the help command with no argument to get an overview of the system.

The help command can give detailed help on any available PLearn command, as well as on any PLearn object class.

There is an on-line html version of the help provided by the help command... See PLearn help on user-level commands and objects on the PLearn homepage...

1.2 Data Matrices

Machine-learning algorithms learn from data and are then used for prediction on new data. In this tutorial, we'll concentrate on the simplest and most usual form of data samples: vectors in $\sf {I\!R}^d$ .

A dataset of $l$ samples is then simply an $l \times d$ matrix of reals. In PLearn such datasets are implemented through the concept of a VMatrix (or VMat in short).

A VMat is essentially:

A $l \times d$ matrix of reals ( is its length, its width),
optionally with an associated fieldname for each column (or field),
optionally with associated inputsize, targetsize, weightsize, extrasize
optionally with strings associated to specific values of a given column

The inputsize, targetsize, weightsize, extrasize are important information for learning algorithms, as they specify which part of each row is to be considered the known input (the first inputsize elements), which part is the target to predict (the next targetsize elements), and whether or not they are followed by a sample weight (weightsize = 0 or 1). The extrasize fields can be used to store any extra information.

For the traditional tasks of statistical machine learning, we have the following conventions regarding datasets and “sizes”:

regression:
inputsize = number of known inputs (“variables”, “factors” or “features”, i.e. dimensioality of “”)
targetsize = number of values to predict (i.e. dimensionality of “”)
classification:
inputsize = number of known inputs
targetsize = 1: the target is the class number (between 0 and nclasses-1)
density estimation:
inputsize = dimensionality of
targetsize = 0

For ex., let's create a simple data set for 1D regression, i.e. to predict a real $y$ from a real $x$ . Open a file 1d_reg.amat with your favorite editor, and and enter the following text definint a $5 \times 2$ matrix:

#size: 5 2
#: x y     
#sizes: 1 1 0 0

0    3
0.5  4
1    5
2    6
3    7.5

This represents a $5 \times 2$ matrix whose columns are named x and y, and whose inputsize=1, targetsize=1, weightsize=0, extrasize=0.

1.3 Viewing Data Matrices

Data matrices can be manipulated with the PLearn command vmat:

valhalla:~/PLearn/examples/Tutorial> plearn help vmat
plearn 0.92.0  (Jun 21 2005 12:04:50)
*** Help for command 'vmat' ***
Examination and manipulation of vmat datasets
Usage: vmat info <dataset>
       Will info about dataset (size, etc..)
   or: vmat fields <dataset> [name_only] [transpose]
       To list the fields with their names (if 'name_only' is specified, the indexes won't be displayed,
       and if 'transpose' is also added, the fields will be listed on a single line)
   or: vmat fieldinfo <dataset> <fieldname_or_num> [--bin]
       To display statistics for that field
   or: vmat bbox <dataset> [<extra_percent>]
       To display the data bounding box (i.e., for each field, its min and max, possibly extended by +-extra_percent ex: 0.10 for +-10% of the data range )
   or: vmat cat <dataset> [<optional_vpl_filtering_code>]
       To display the dataset
   or: vmat sascat <dataset.vmat> <dataset.txt>
       To output in <dataset.txt> the dataset in SAS-like tab-separated format with field names on the first line
   or: vmat view <dataset>
       Interactive display to browse on the data.
   or: vmat stats <dataset>
       Will display basic statistics for each field
   or: vmat convert <source> <destination> [--cols=col1,col2,col3,...]
       To convert any dataset into a .amat, .pmat, .dmat or .csv format.
       The extension of the destination is used to determine the format you want.
       If the option --cols is specified, it requests to keep only the given columns
       (no space between the commas and the columns); columns can be given either as a
       number (zero-based) or a column name (string).  You can also specify a range,
       such as 0-18, or any combination thereof, e.g. 5,3,8-18,Date,74-85
       If .csv (Comma-Separated Value) is specified as the destination file, the
       following additional options are also supported:
         --skip-missings: if a row (after selecting the appropriate columns) contains
                          one or more missing values, it is skipped during export
         --precision=N:   a maximum of N digits is printed after the decimal point
         --delimiter=C:   use character C as the field delimiter (default = ',')
   or: vmat gendef <source> [binnum1 binnum2 ...]
       Generate stats for dataset (will put them in its associated metadatadir).
   or: vmat genvmat <source_dataset> <dest_vmat> [binned{num} | onehot{num} | normalized]
       Will generate a template .vmat file with all the fields of the source preprocessed
       with the processing you specify
   or: vmat genkfold <source_dataset> <fileprefix> <kvalue>
       Will generate <kvalue> pairs of .vmat that are splitted so they can be used for kfold trainings
       The first .vmat-pair will be named <fileprefix>_train_1.vmat (all source_dataset except the first 1/k)
       and <fileprefix>_test_1.vmat (the first 1/k of <source_dataset>
   or: vmat diff <dataset1> <dataset2> [<tolerance> [<verbose>]]
       Will report all elements that differ by more than tolerance (defauts to 1e-6).
       If verbose==0 then print only total number of differences
   or: vmat cdf <dataset> [<dataset> ...]
       To interactively display cumulative density function for each field
       along with its basic statistics
   or: vmat diststat <dataset> <inputsize>
       Will compute and output basic statistics on the euclidean distance
       between two consecutive input points

<dataset> is a parameter understandable by getDataSet:
Dataset specification can be one of:
 - the path to a matrix file (or directory) .amat .pmat .vmat .dmat or plain ascii
 - ...

OK, too many subcommands here, but let's concentrate on the few ones you're most likely to use:

valhalla:~/PLearn/examples/Tutorial> plearn vmat info 1d_reg.amat
plearn 0.92.0  (Jun 21 2005 12:04:50)
5 x 2
inputsize: 1
targetsize: 1
weightsize: 0
extrasize: 0


valhalla:~/PLearn/examples/Tutorial> plearn vmat fields 1d_reg.amat
plearn 0.92.0  (Jun 21 2005 12:04:50)
FieldNames:
0: x
1: y


valhalla:~/PLearn/examples/Tutorial> plearn vmat fieldinfo 1d_reg.amat y
plearn 0.92.0  (Jun 21 2005 12:04:50)
[------------------------------------- Computing statistics (5) -------------------------------------]
[....................................................................................................]
Field #1:  y     type: UnknownType
nmissing: 0
nnonmissing: 5
sum: 25.5
mean: 5.09999999999999964
stddev: 1.74642491965729807
min: 3
max: 7.5


valhalla:~/PLearn/examples/Tutorial> plearn vmat cat 1d_reg.amat
plearn 0.92.0  (Jun 21 2005 12:04:50)
0 3
0.5 4
1 5
2 6
3 7.5

If you want to browse the data matrix interactively, you can use the command plearn vmat view 1d_reg.amat (This is most useful for huge data sets.... plearn need to be compiled with curse.)

You can also see the points graphically by using the pyplot script pyplot plot_2d 1d_reg.amat

1.4 vmat File Formats

The V in VMatrix stands for Virtual, because VMatrix is a C++ virtual base class of which there are several concrete derived classes (do a plearn help VMatrix if you want to see how many...).

Accordingly, there are several file formats that represent real data matrices, distinguished by their file extension:

extension	format description
.amat	Simple ascii format
.pmat	Simple raw binary format with 1 line ascii header
.dmat	Directory containing compressed binary data
	(possibly split in several files for huge data)
.vmat	Contains the specification of a C++ VMatrix object
	(in PLearn's ascii serialisation format)
.pymat	Python preprocessing code that generates the
	specification of a C++ VMatrix object (a la .vmat)

In addition, several of those tend to have an associated .metadata directory, that will contain associated data that is not held within the file itself (for ex: fieldnames, inputsize and targetsize, field statistics, etc...)

You can convert from any format to .amat, .pmat, .dmat, .csv with PLearn command vmat convert:

plearn vmat convert 1d_reg.amat 1d_reg.pmat
plearn vmat view 1d_reg.pmat

1.5 PLearn Objects, Their Serialization and Specification

PLearn is first and foremost a C++ class library. PLearn also provides a mechanism to serialize such objects to and from files (i.e. write a representation of an in-memory object to a file, or later reload such a saved object from that file). PLearn serialization supports both an ASCII human-readable format (plearn_ascii), and a more efficient binary format (plearn_binary).

As a result of this capability, it is also possible to specify a PLearn object by simply writing its ASCII serialized form by hand. This is basically what a .vmat file contains: the ASCII serialised form of a C++ subclass of VMatrix.

For example, create a file selected_rows.vmat with the following content:

SelectRowsVMatrix(
  source = AutoVMatrix( specification = "1d_reg.amat" ),
  indices = [ 1 1 3 0 3 4],
  inputsize =   1,
  targetsize =  0,
  weightsize =  1
);

The serialised form of most PLearn objects, as can be seen here, is:

ObjectName(  
  optionname = optionval
  optionname = optionval
  ...
)

Note that in plearn_ascii format, in general, spaces, newlines, commas and semicolons are ignored (any sequence of those is considered a single separator).

There is typically a one to one correspondance between an object's options (in its serialised form) and the fields of the corresponding C++ object. A PLearn object often has many options, but they always have a default value, so that there is no need to explicitly set those for which the default value is fine.

The above .vmat specifies an object of type SelectRowsVMatrix, which is a sort of vmat that will select desired rows from another “source” vmat. selected_rows.vmat will thus be an altered view of 1d_reg.amat, for which we also change the values of inputsize, targetsize, weightsize.

valhalla:~/PLearn/examples/Tutorial> plearn vmat info selected_rows.vmat
plearn 0.92.0  (Jun 22 2005 19:42:18)
6 x 2
inputsize: 1
targetsize: 0
weightsize: 1


valhalla:~/PLearn/examples/Tutorial> plearn vmat cat selected_rows.vmat
plearn 0.92.0  (Jun 22 2005 19:42:18)
0.5 4
0.5 4
2 6
0 3
2 6
3 7.5

Help on any plearn object can be obtained, as usual, by invoking plearn help objectclass. This will output a commented serialised object, with all its build options and their default value. This help is also available in online html form. For ex. try:

plearn help SelectRowsVMatrix

This makes for a good starting point for writing a .vmat (or .plearn), as you can issue:

plearn help SelectRowsVMatrix > mymat.vmat

and then edit the file to your liking (removing unnecessary options that are to keep their default value, etc...)

.vmat is not the only file extension associated with specifications of PLearn objects in serialised form. Here are the other extensions you may encounter:

extension	format description
.vmat	specification of a subclass of VMatrix in `plearn_ascii`
	serialization format (with rudimentary macro-processing)
.plearn	specification of any PLearn object in `plearn_ascii`
	format (with rudimentary macro-processing)
.psave	serialized PLearn object in `plearn_ascii` or `plearn_binary`
	format (does not undergo macro-explansion)
.pymat	Python preprocessing code that generates the
	`plearn_ascii` specification of a VMatrix subclass
.pyplearn	Python preprocessing code that generates the
	`plearn_ascii` specification of any PLearn object

While .vmat and .plearn support some rudimentary macro-processing, this is deprecated in favor of the power of the Python preprocessing of .pymat and .pyplearn files. We will get back to this later.

1.6 plearn Learner

The concept of a learning algorithm in PLearn is implemented through the PLearner class. Conceptually a PLearner is an object that:

can be trained using a training data set (which contains input and target)
can then be used by computing outputs corresponding to new inputs
can be tested on a test set (containing input and target) and report statistics on some costs (ex: classification error rate).
can be saved to and loaded from file (like any PLearn object)

The meaning and form of the output vector are learner-dependant, but in PLearn we try to respect the following convention for standard tasks:

regression: output is the predicted target (i.e. same dimension as terget)
classification: target is a scalar between and nclasses-1; output is a vector of length nclasses giving a score for each class (the higher, the more likely).
density estimation: output is typically the log of the estimated density at (but this can be controlled by an option, if you want for ex. the density instead of the log).

For ex. let us create a file linreg.plearn with the following content:

LinearRegressor(
  weight_decay = 1e-6
  )

LinearRegressor is a subclass of PLearner and as such, it can be trained, used, tested with the plearn learner command:

valhalla:~/PLearn/examples/Tutorial> plearn help learner
plearn 0.92.0  (Jun 22 2005 19:42:18)
*** Help for command 'learner' ***
Allows to train, use and test a learner
learner train <learner_spec.plearn> <trainset.vmat> <trained_learner.psave>
  -> Will train the specified learner on the specified trainset and save the resulting trained learner as
     trained_learner.psave

learner test <trained_learner.psave> <testset.vmat> <cost.stats> [<outputs.pmat>] [<costs.pmat>]
  -> Tests the specified learner on the testset. Will produce a cost.stats file (viewable with the plearn stats
     command) and optionally saves individual outputs and costs

learner compute_outputs <trained_learner.psave> <test_inputs.vmat> <outputs.pmat> (or 'learner co' as a shortcut)

learner compute_outputs_on_1D_grid <trained_learner.psave> <gridoutputs.pmat> <xmin> <xmax> <nx> (shortcut: learner cg1)
  -> Computes output of learner on nx equally spaced points in range [xmin, xmax] and writes the list of (x,output)
     in gridoutputs.pmat

learner compute_outputs_on_2D_grid <trained_learner.psave> <gridoutputs.pmat> <xmin> <xmax> <ymin> <ymax> <nx> <ny> (shortcut: learner cg2)
  -> Computes output of learner on the regular 2d grid specified and writes the list of (x,y,output) in gridoutputs.pmat

learner compute_outputs_on_auto_grid <trained_learner.psave> <gridoutputs.pmat> <trainset.vmat> <nx> [<ny>] (shortcut: learner cg)
  -> Automatically determines a bounding-box from the trainset (enlarged by 5%), and computes the output along a
     regular 1D grid of <nx> points or a regular 2D grid of <nx>*<ny> points. (Note: you can also invoke command vmat
     bbox to determine the bounding-box by yourself, and then invoke learner cg1 or learner cg2 appropriately)

learner analyze_inputs <data.vmat> <results.pmat> <epsilon> <learner_1> ... <learner_n>
  -> Analyze the influence of inputs of given learners. The output of each sample in the data VMatrix is computed
     when each input is perturbed, so as to estimate the derivative of the output with respect to the input. This
     is averaged over all samples and all learners so as to estimate the influence of each input. In the results.pmat
     file, are stored the average, variance, min and max of the derivative for all inputs (and outputs).

The datasets do not need to be .vmat they can be any valid vmatrix (.amat .pmat .dmat)

To train this linear regressor on our data-set 1d_reg.amat and save the resulting trained learner as linreg_trained.psave we issue the following command:

plearn learner train linreg.plearn 1d_reg.amat linreg_trained.psave

To get the predicions of the trained learner on new data that was not in the training set, (for ex. $x=0.25, x=1.5, x=2.5$ ) we can create a file 1d_reg_test.amat containing

#size: 3 1
#: x
#sizes: 1 0 0
0.25
1.5
2.5

and issue the commands

valhalla:~/PLearn/examples/Tutorial> plearn learner compute_outputs linreg_trained.psave 1d_reg_test.amat 1d_reg_test_outputs.pmat
plearn 0.92.0  (Jun 22 2005 19:42:18)
[---------------------------------------- Using learner (3) -----------------------------------------]
[....................................................................................................]

valhalla:~/PLearn/examples/Tutorial> plearn vmat cat 1d_reg_test_outputs.pmat
plearn 0.92.0  (Jun 22 2005 19:42:18)
3.58836232959270118
5.3879309848394854
6.82758590903691243

We thus get the predictions output by the learner.

To see the learnt parameters of the trained learner, we can examine the file linreg_trained.psave :

*1 ->LinearRegressor(
include_bias = 1 ;
cholesky = 1 ;
weight_decay = 9.99999999999999955e-07 ;
output_learned_weights = 0 ;
weights = 2  1  [
3.22844859854334443
1.43965492419742724
]
;
AIC = -2.53047027031051597 ;
BIC = -2.6866951053368755 ;
resid_variance = 1 [ 0.0596271276504959716 ] ;
expdir = "" ;
stage = 0 ;
n_examples = 5 ;
inputsize = 1 ;
targetsize = 1 ;
weightsize = 0 ;
forget_when_training_set_changes = 0 ;
nstages = 1 ;
report_progress = 1 ;
verbosity = 1 ;
nservers = 0  )

We can see that there are many more options in the saved learner than what we specified. In particular the weights option gives us the parameters tuned by the learning (i.e. the regression weights).

For 1D regression problems such as this, we can easily display the predicted output along the real line:

pyplot 1d_regression 1d_reg.amat linreg.plearn

This will train the given learner on the given training set, compute the output prediction along the real line, and plot the result.

1.7 A density estimation example

Let's make a new data matrix spiral.vmat containing:

VMatrixFromDistribution(
  distr = SpiralDistribution(),
  # nsamples=10600,
  nsamples=200,
  inputsize=2,
  targetsize=0,
  weightsize=0);

valhalla:~/PLearn/examples/Tutorial> plearn vmat view spiral.vmat

valhalla:~/PLearn/examples/Tutorial> pyplot plot_2d spiral.vmat

Now let's make parzen.plearn

ParzenWindow(
sigma_square = 0.06;
outputs_def = "d"  ;
);

and check how well it estimates the density:

valhalla:~/PLearn/examples/Tutorial> pyplot 2d_density spiral.vmat parzen.plearn

1.8 A classification example

See the older tutorial.2

Note that we can make a classification data set by issuing

pypoints 2d_classif.amat

1.9 Running a Full Experiment: PTester

The class PTester is used to wrap the action of running a complete experiment in a single runnable PLearn object. The goals of this class are as follows:

Take a dataset (either a .amat, .vmat, .pmat or .pymat) and split it into one or more training and test sets. We shall denote the -th such split as Split-k.
For each split, the PTester trains an associated learner (which must be of a class derived from PLearner) on the training set of the split.
For each split, the PTester then tests the trained learner on the testset data. Afterwards, it can compute performance statistics and report.

The relationship among the various parts is illustrated in Figure 1.1.

**Figure 1.1:** Relationship among the classes taking part in the experiment run by PTester. The PLearner must actually be an instance of a class derived from PLearner; likewise, the Splitter must be an instance of a class derived from Splitter. The desired statistics are specified as options of the PTester object, and the experiment results are stored in the experiment directory.
$\resizebox{0.85\textwidth}{!}{\includegraphics{Figures/PTesterOverall}}$

1.9.1 Process Underlying PTester

The process underlying PTester is illustrated in Figure 1.2.

**Figure 1.2:** Process Underlying PTester
$\resizebox{0.85\textwidth}{!}{\includegraphics{Figures/PTesterProcess}}$

1.9.2 Experiment Directory

PTester executes its experiment in a designated experiment directory (often abbreviated expdir, the name of the option used to specify it within the PTester object.) This directory should be empty at the beginning of the experiment (if it does not exist, it is created automatically); if it contains the results of a previous experiment, PTester complains loudly and exits immediately.

Note that if you run your experiments from .pyplearn scripts, a synthetic experiment directory of the form expdir_YYYY_MM_DD_HH:MM:SS is created for you automatically, which pretty much guarantees uniqueness of the name.

1.9.3 Example

(See the .pyplearn tutorial.)

1.10 Python Preprocessing

See the pyplearn tutorial