This chapter is a tutorial that will walk you through the basic concepts from a user-level perspective.
We assume you have a copy of the plearn distribution, and a working plearn
executable accessible through yout PATH. All the files in this tutorial are in examples/Tutorial/
so you should first cd to this directory.
Usual PLearn executables such as plearn
or plearn_light
are
typically called in command-line fashion.
valhalla:~/PLearn/examples/Tutorial> plearn plearn 0.92.0 (Jun 21 2005 12:04:50) Type 'plearn help' for help valhalla:~/PLearn/examples/Tutorial> plearn help plearn 0.92.0 (Jun 21 2005 12:04:50) To run a .plearn script type: plearn scriptfile.plearn To run a command type: plearn command [ command arguments ] To get help on the script file format: plearn help scripts To get a short description of available commands: plearn help commands To get detailed help on a specific command: plearn help <command_name> To get help on a specific PLearn object: plearn help <object_type_name> To get help on datasets: plearn help datasets
The plearn executable can be invoked either with a PLearn script (more on that later) or with a PLearn command.
To get the list of available commands:
valhalla:~/PLearn/examples/Tutorial> plearn help commands plearn 0.92.0 (Jun 21 2005 12:04:50) To run a command, type: % plearn command_name command_arguments Available commands are: FieldConvert : Reads a dataset and generates a .vmat file based on the data, but optimized for training. autorun : watches files for changes and reruns the .plearn script help : plearn command-line help htmlhelp : Output HTML-formatted help for PLearn jdate : Convert a Julian Date into a JJ/MM/YYYY date ks-stat : Computes the Kolmogorov-Smirnov statistic between 2 matrix columns learner : Allows to train, use and test a learner read_and_write : Used to check (debug) the serialization system run : runs a .plearn script server : Launches plearn in computation server mode test-dependencies : Compute dependency statistics between input and target variables. test-dependency : Compute dependency statistics between two selected columns of a vmat. vmat : Examination and manipulation of vmat datasets For more details on a specific command, type: % plearn help <command_name>
PLearn commands accept a number of arguments that are command specific. Very often the first argument is itself a sub-command...
help is actually a PLearn command! Thus we can ask help on help!
valhalla:~/PLearn/examples/Tutorial> plearn help help plearn 0.92.0 (Jun 21 2005 12:04:50) *** Help for command 'help' *** plearn command-line help help <topic> Run the help command with no argument to get an overview of the system.
The help command can give detailed help on any available PLearn command, as well as on any PLearn object class.
There is an on-line html version of the help provided by the help command... See PLearn help on user-level commands and objects on the PLearn homepage...
Machine-learning algorithms learn from data and are then used for prediction on new data. In this tutorial, we'll concentrate on the simplest and most usual form of data samples: vectors in .
A dataset of samples is then simply an matrix of reals. In PLearn such datasets are implemented through the concept of a VMatrix (or VMat in short).
A VMat is essentially:
The inputsize, targetsize, weightsize, extrasize are important information for learning algorithms, as they specify which part of each row is to be considered the known input (the first inputsize elements), which part is the target to predict (the next targetsize elements), and whether or not they are followed by a sample weight (weightsize = 0 or 1). The extrasize fields can be used to store any extra information.
For the traditional tasks of statistical machine learning, we have the following conventions regarding datasets and “sizes”:
For ex., let's create a simple data set for 1D regression, i.e. to predict a real from a real .
Open a file 1d_reg.amat
with your favorite editor, and and enter the following text definint a
matrix:
#size: 5 2 #: x y #sizes: 1 1 0 0 0 3 0.5 4 1 5 2 6 3 7.5
This represents a matrix whose columns are named x and y, and whose inputsize=1, targetsize=1, weightsize=0, extrasize=0.
Data matrices can be manipulated with the PLearn command vmat:
valhalla:~/PLearn/examples/Tutorial> plearn help vmat plearn 0.92.0 (Jun 21 2005 12:04:50) *** Help for command 'vmat' *** Examination and manipulation of vmat datasets Usage: vmat info <dataset> Will info about dataset (size, etc..) or: vmat fields <dataset> [name_only] [transpose] To list the fields with their names (if 'name_only' is specified, the indexes won't be displayed, and if 'transpose' is also added, the fields will be listed on a single line) or: vmat fieldinfo <dataset> <fieldname_or_num> [--bin] To display statistics for that field or: vmat bbox <dataset> [<extra_percent>] To display the data bounding box (i.e., for each field, its min and max, possibly extended by +-extra_percent ex: 0.10 for +-10% of the data range ) or: vmat cat <dataset> [<optional_vpl_filtering_code>] To display the dataset or: vmat sascat <dataset.vmat> <dataset.txt> To output in <dataset.txt> the dataset in SAS-like tab-separated format with field names on the first line or: vmat view <dataset> Interactive display to browse on the data. or: vmat stats <dataset> Will display basic statistics for each field or: vmat convert <source> <destination> [--cols=col1,col2,col3,...] To convert any dataset into a .amat, .pmat, .dmat or .csv format. The extension of the destination is used to determine the format you want. If the option --cols is specified, it requests to keep only the given columns (no space between the commas and the columns); columns can be given either as a number (zero-based) or a column name (string). You can also specify a range, such as 0-18, or any combination thereof, e.g. 5,3,8-18,Date,74-85 If .csv (Comma-Separated Value) is specified as the destination file, the following additional options are also supported: --skip-missings: if a row (after selecting the appropriate columns) contains one or more missing values, it is skipped during export --precision=N: a maximum of N digits is printed after the decimal point --delimiter=C: use character C as the field delimiter (default = ',') or: vmat gendef <source> [binnum1 binnum2 ...] Generate stats for dataset (will put them in its associated metadatadir). or: vmat genvmat <source_dataset> <dest_vmat> [binned{num} | onehot{num} | normalized] Will generate a template .vmat file with all the fields of the source preprocessed with the processing you specify or: vmat genkfold <source_dataset> <fileprefix> <kvalue> Will generate <kvalue> pairs of .vmat that are splitted so they can be used for kfold trainings The first .vmat-pair will be named <fileprefix>_train_1.vmat (all source_dataset except the first 1/k) and <fileprefix>_test_1.vmat (the first 1/k of <source_dataset> or: vmat diff <dataset1> <dataset2> [<tolerance> [<verbose>]] Will report all elements that differ by more than tolerance (defauts to 1e-6). If verbose==0 then print only total number of differences or: vmat cdf <dataset> [<dataset> ...] To interactively display cumulative density function for each field along with its basic statistics or: vmat diststat <dataset> <inputsize> Will compute and output basic statistics on the euclidean distance between two consecutive input points <dataset> is a parameter understandable by getDataSet: Dataset specification can be one of: - the path to a matrix file (or directory) .amat .pmat .vmat .dmat or plain ascii - ...
OK, too many subcommands here, but let's concentrate on the few ones you're most likely to use:
valhalla:~/PLearn/examples/Tutorial> plearn vmat info 1d_reg.amat plearn 0.92.0 (Jun 21 2005 12:04:50) 5 x 2 inputsize: 1 targetsize: 1 weightsize: 0 extrasize: 0 valhalla:~/PLearn/examples/Tutorial> plearn vmat fields 1d_reg.amat plearn 0.92.0 (Jun 21 2005 12:04:50) FieldNames: 0: x 1: y valhalla:~/PLearn/examples/Tutorial> plearn vmat fieldinfo 1d_reg.amat y plearn 0.92.0 (Jun 21 2005 12:04:50) [------------------------------------- Computing statistics (5) -------------------------------------] [....................................................................................................] Field #1: y type: UnknownType nmissing: 0 nnonmissing: 5 sum: 25.5 mean: 5.09999999999999964 stddev: 1.74642491965729807 min: 3 max: 7.5 valhalla:~/PLearn/examples/Tutorial> plearn vmat cat 1d_reg.amat plearn 0.92.0 (Jun 21 2005 12:04:50) 0 3 0.5 4 1 5 2 6 3 7.5
If you want to browse the data matrix interactively, you can use the
command plearn vmat view 1d_reg.amat
(This is most useful for huge data sets.... plearn
need to be compiled with curse
.)
You can also see the points graphically by using the pyplot
script
pyplot plot_2d 1d_reg.amat
The V in VMatrix stands for Virtual, because VMatrix is a C++ virtual
base class of which there are several concrete derived classes (do a
plearn help VMatrix
if you want to see how many...).
Accordingly, there are several file formats that represent real data matrices, distinguished by their file extension:
extension | format description |
.amat | Simple ascii format |
.pmat | Simple raw binary format with 1 line ascii header |
.dmat | Directory containing compressed binary data |
(possibly split in several files for huge data) | |
.vmat | Contains the specification of a C++ VMatrix object |
(in PLearn's ascii serialisation format) | |
.pymat | Python preprocessing code that generates the |
specification of a C++ VMatrix object (a la .vmat) |
In addition, several of those tend to have an associated .metadata directory, that will contain associated data that is not held within the file itself (for ex: fieldnames, inputsize and targetsize, field statistics, etc...)
You can convert from any format to .amat, .pmat, .dmat, .csv with PLearn command vmat convert:
plearn vmat convert 1d_reg.amat 1d_reg.pmat plearn vmat view 1d_reg.pmat
PLearn is first and foremost a C++ class library. PLearn also provides a
mechanism to serialize such objects to and from files (i.e. write a
representation of an in-memory object to a file, or later reload such a
saved object from that file). PLearn serialization supports both an ASCII
human-readable format (plearn_ascii
), and a more efficient binary format (plearn_binary
).
As a result of this capability, it is also possible to specify a PLearn object by simply writing its ASCII serialized form by hand. This is basically what a .vmat file contains: the ASCII serialised form of a C++ subclass of VMatrix.
For example, create a file selected_rows.vmat
with the following content:
SelectRowsVMatrix( source = AutoVMatrix( specification = "1d_reg.amat" ), indices = [ 1 1 3 0 3 4], inputsize = 1, targetsize = 0, weightsize = 1 );
The serialised form of most PLearn objects, as can be seen here, is:
ObjectName( optionname = optionval optionname = optionval ... )
Note that in plearn_ascii
format, in general, spaces, newlines, commas and semicolons
are ignored (any sequence of those is considered a single separator).
There is typically a one to one correspondance between an object's options (in its serialised form) and the fields of the corresponding C++ object. A PLearn object often has many options, but they always have a default value, so that there is no need to explicitly set those for which the default value is fine.
The above .vmat specifies an object of type SelectRowsVMatrix,
which is a sort of vmat that will select desired rows from another
“source” vmat. selected_rows.vmat
will thus be an altered
view of 1d_reg.amat
, for which we also change the values of inputsize, targetsize, weightsize.
valhalla:~/PLearn/examples/Tutorial> plearn vmat info selected_rows.vmat plearn 0.92.0 (Jun 22 2005 19:42:18) 6 x 2 inputsize: 1 targetsize: 0 weightsize: 1 valhalla:~/PLearn/examples/Tutorial> plearn vmat cat selected_rows.vmat plearn 0.92.0 (Jun 22 2005 19:42:18) 0.5 4 0.5 4 2 6 0 3 2 6 3 7.5
Help on any plearn object can be obtained, as usual, by invoking
plearn help
objectclass. This will output a commented serialised object, with all its build options and their default value. This help is also available in online html form. For ex. try:
plearn help SelectRowsVMatrix
This makes for a good starting point for writing a .vmat (or .plearn), as you can issue:
plearn help SelectRowsVMatrix > mymat.vmatand then edit the file to your liking (removing unnecessary options that are to keep their default value, etc...)
.vmat is not the only file extension associated with specifications of PLearn objects in serialised form. Here are the other extensions you may encounter:
extension | format description |
.vmat | specification of a subclass of VMatrix in plearn_ascii |
serialization format (with rudimentary macro-processing) | |
.plearn | specification of any PLearn object in plearn_ascii |
format (with rudimentary macro-processing) | |
.psave | serialized PLearn object in plearn_ascii or plearn_binary |
format (does not undergo macro-explansion) | |
.pymat | Python preprocessing code that generates the |
plearn_ascii specification of a VMatrix subclass |
|
.pyplearn | Python preprocessing code that generates the |
plearn_ascii specification of any PLearn object |
While .vmat and .plearn support some rudimentary macro-processing, this is deprecated in favor of the power of the Python preprocessing of .pymat and .pyplearn files. We will get back to this later.
The concept of a learning algorithm in PLearn is implemented through the PLearner class. Conceptually a PLearner is an object that:
The meaning and form of the output vector are learner-dependant, but in PLearn we try to respect the following convention for standard tasks:
For ex. let us create a file linreg.plearn
with the following content:
LinearRegressor( weight_decay = 1e-6 )
LinearRegressor is a subclass of PLearner and as such, it can be trained,
used, tested with the plearn learner
command:
valhalla:~/PLearn/examples/Tutorial> plearn help learner plearn 0.92.0 (Jun 22 2005 19:42:18) *** Help for command 'learner' *** Allows to train, use and test a learner learner train <learner_spec.plearn> <trainset.vmat> <trained_learner.psave> -> Will train the specified learner on the specified trainset and save the resulting trained learner as trained_learner.psave learner test <trained_learner.psave> <testset.vmat> <cost.stats> [<outputs.pmat>] [<costs.pmat>] -> Tests the specified learner on the testset. Will produce a cost.stats file (viewable with the plearn stats command) and optionally saves individual outputs and costs learner compute_outputs <trained_learner.psave> <test_inputs.vmat> <outputs.pmat> (or 'learner co' as a shortcut) learner compute_outputs_on_1D_grid <trained_learner.psave> <gridoutputs.pmat> <xmin> <xmax> <nx> (shortcut: learner cg1) -> Computes output of learner on nx equally spaced points in range [xmin, xmax] and writes the list of (x,output) in gridoutputs.pmat learner compute_outputs_on_2D_grid <trained_learner.psave> <gridoutputs.pmat> <xmin> <xmax> <ymin> <ymax> <nx> <ny> (shortcut: learner cg2) -> Computes output of learner on the regular 2d grid specified and writes the list of (x,y,output) in gridoutputs.pmat learner compute_outputs_on_auto_grid <trained_learner.psave> <gridoutputs.pmat> <trainset.vmat> <nx> [<ny>] (shortcut: learner cg) -> Automatically determines a bounding-box from the trainset (enlarged by 5%), and computes the output along a regular 1D grid of <nx> points or a regular 2D grid of <nx>*<ny> points. (Note: you can also invoke command vmat bbox to determine the bounding-box by yourself, and then invoke learner cg1 or learner cg2 appropriately) learner analyze_inputs <data.vmat> <results.pmat> <epsilon> <learner_1> ... <learner_n> -> Analyze the influence of inputs of given learners. The output of each sample in the data VMatrix is computed when each input is perturbed, so as to estimate the derivative of the output with respect to the input. This is averaged over all samples and all learners so as to estimate the influence of each input. In the results.pmat file, are stored the average, variance, min and max of the derivative for all inputs (and outputs). The datasets do not need to be .vmat they can be any valid vmatrix (.amat .pmat .dmat)
To train this linear regressor on our data-set 1d_reg.amat
and save
the resulting trained learner as linreg_trained.psave
we issue the
following command:
plearn learner train linreg.plearn 1d_reg.amat linreg_trained.psave
To get the predicions of the trained learner on new data that was not in
the training set, (for ex.
) we can create a file
1d_reg_test.amat
containing
#size: 3 1 #: x #sizes: 1 0 0 0.25 1.5 2.5
and issue the commands
valhalla:~/PLearn/examples/Tutorial> plearn learner compute_outputs linreg_trained.psave 1d_reg_test.amat 1d_reg_test_outputs.pmat plearn 0.92.0 (Jun 22 2005 19:42:18) [---------------------------------------- Using learner (3) -----------------------------------------] [....................................................................................................] valhalla:~/PLearn/examples/Tutorial> plearn vmat cat 1d_reg_test_outputs.pmat plearn 0.92.0 (Jun 22 2005 19:42:18) 3.58836232959270118 5.3879309848394854 6.82758590903691243
We thus get the predictions output by the learner.
To see the learnt parameters of the trained learner, we can examine the file linreg_trained.psave
:
*1 ->LinearRegressor( include_bias = 1 ; cholesky = 1 ; weight_decay = 9.99999999999999955e-07 ; output_learned_weights = 0 ; weights = 2 1 [ 3.22844859854334443 1.43965492419742724 ] ; AIC = -2.53047027031051597 ; BIC = -2.6866951053368755 ; resid_variance = 1 [ 0.0596271276504959716 ] ; expdir = "" ; stage = 0 ; n_examples = 5 ; inputsize = 1 ; targetsize = 1 ; weightsize = 0 ; forget_when_training_set_changes = 0 ; nstages = 1 ; report_progress = 1 ; verbosity = 1 ; nservers = 0 )
We can see that there are many more options in the saved learner than what we specified. In particular the weights option gives us the parameters tuned by the learning (i.e. the regression weights).
For 1D regression problems such as this, we can easily display the predicted output along the real line:
pyplot 1d_regression 1d_reg.amat linreg.plearn
This will train the given learner on the given training set, compute the output prediction along the real line, and plot the result.
Let's make a new data matrix spiral.vmat
containing:
VMatrixFromDistribution( distr = SpiralDistribution(), # nsamples=10600, nsamples=200, inputsize=2, targetsize=0, weightsize=0);
valhalla:~/PLearn/examples/Tutorial> plearn vmat view spiral.vmat valhalla:~/PLearn/examples/Tutorial> pyplot plot_2d spiral.vmat
Now let's make parzen.plearn
ParzenWindow( sigma_square = 0.06; outputs_def = "d" ; );
and check how well it estimates the density:
valhalla:~/PLearn/examples/Tutorial> pyplot 2d_density spiral.vmat parzen.plearn
See the older tutorial.2
Note that we can make a classification data set by issuing
pypoints 2d_classif.amat
The class PTester
is used to wrap the action of running a complete
experiment in a single runnable PLearn
object. The goals of this
class are as follows:
PTester
trains an associated learner
(which must be of a class derived from PLearner
) on the training
set of the split.
PTester
then tests the trained learner
on the testset data. Afterwards, it can compute performance statistics
and report.
The relationship among the various parts is illustrated in Figure 1.1.
|
The process underlying PTester is illustrated in Figure 1.2.
PTester
executes its experiment in a designated experiment
directory (often abbreviated expdir
, the name of the option used
to specify it within the PTester
object.) This directory should be
empty at the beginning of the experiment (if it does not exist, it is
created automatically); if it contains the results of a previous
experiment, PTester
complains loudly and exits immediately.
Note that if you run your experiments from .pyplearn
scripts, a
synthetic experiment directory of the form
expdir_YYYY_MM_DD_HH:MM:SS
is created for you automatically, which
pretty much guarantees uniqueness of the name.
(See the .pyplearn
tutorial.)
See the pyplearn tutorial