Subsections

6. Appendix A: File Formats

6.1 The .plearn and .psave Formats

6.1.1 Generalities on mixing ascii and binary

The following characters are in many cases skipped before reading any element: space, tab, newline, carriage-return, comma and semicolon. They are essentially ignored. Binary serialized things should always start with a non-printable ascii character.

6.1.2 TVec and TMat

TVec and TMat will be serialized differently depending on the implicit_storage flag of the PStream they are being written to.

If implicit_storage is set, then serialization won't write the actual whole structure of the TVec or TMat, but will only save the size information and elements as a 1D or 2D sequence (see 6.1.4 and 6.1.5), ex:

4 [ 1.2 3.5 2.8 5.2 ]

3 2 [
0.1    0.2
0.3    0.4
0.5    0.6
]

If implicit_storage is false, then the complete structure of the TVec or TMat with the pointer to its storage (possibly shared with others) will be written explicitly. This corresponds to true, deep serialization.

Ex:

TVec( 4 0 
*1->Storage(4 [ 1.2 3.5 2.8 5.2 ]) )

TMat( 3 2 2 0 
*2->Storage(6 [ 0.1 0.2 0.3 0.4 0.5 0.6 ] ) )

For TVec, we have length offset followed by the storage pointer. For TMat, we have length width mod offset followed by the storage pointer.

This allows to keep structure. For example, if we had a submatrix viewing the second column of the previous TMat, we would have:

TMat( 3 1 2 1
*2 )

6.1.3 Binary PLearn format for base types

To allow mixing of ascii and binary in a file, a non-printable ascii character is used as a one-byte header to identify any binary portion. In Table 6.1 we give the header codes for all basic types

Note that char is considered to be the same as signed char, and long is considered to be the same as int, i.e.: 4-bytes long, which is the case on current architectures.


Table 6.1: Binary-header codes for base types
Base type Byte order Header byte Number of bytes to follow
char - 0x01 1
signed char - 0x01 1
unsigned char - 0x02 1
short little-endian 0x03 2
short big-endian 0x04 2
unsigned short little-endian 0x05 2
unsigned short big-endian 0x06 2
int little-endian 0x07 4
int big-endian 0x08 4
unsigned int little-endian 0x0B 4
unsigned int big-endian 0x0C 4
long little-endian 0x07 4
long big-endian 0x08 4
unsigned long little-endian 0x0B 4
unsigned long big-endian 0x0C 4
float little-endian 0x0E 4
float big-endian 0x0F 4
double little-endian 0x10 8
double big-endian 0x11 8
PRInt64 little-endian 0x16 4
PRInt64 big-endian 0x17 4
PRUint64 little-endian 0x18 4
PRUint64 big-endian 0x19 4


6.1.4 Ascii PLearn format for a sequence

We consider both one-dimensional sequences ( array, vector, ...) which only have a length, and two-dimensional sequences which have a length and a width.

Ascii-serialized one-dimensional sequences will have the following format:

length [ ... ... ... ]

with the elements of the sequence separated by a single space.

However, on reading, several variations of this format are recognized:

Ascii-serialized two-dimensional sequences will have the following format:

length width [

... ... ...
... ... ... 
]

with the elements of each row separated by a tab, and the rows separated by a newline.

However on reading, blanks, commas and semi-colons between elements are completely ignored (skipped), so you may format the data as you wish.

2D Sequences are used exclusively for TMats. Notice that it's also possible to make a 1D sequence of 1D sequences, but that's different from a 2D sequence.


6.1.5 Binary PLearn format for a sequence

We consider both one-dimensional sequences ( array, vector, ...) which only have a length, and two-dimensional sequences which have a length and a width.

The following table gives the corresponding header-byte:

Type of sequence byte-order Header byte
one-dimensional little-endian 0x12
one-dimensional big-endian 0x13
two-dimensional little-endian 0x14
two-dimensional big-endian 0x15

All that follows is supposed to be in the byte-order implied by the header-byte.

The first header-byte is followed by an element-type byte giving the nature of the elements in the sequence. It can be either the byte identifying a base-type given in Table 6.1 (the endianness must match), or '0' = 0x30 to indicate a sequence of booleans (1 byte per boolean) or 0xFF to indicate a generic sequence.

The header bytes are followed by one (for 1D sequences) or two (for 2D) 4-byte int to indicate the length (and possibly width) of the sequence. So the total header size for sequences is 6 bytes for 1D sequences and 10 bytes for 2D sequences.

This header is followed by a dump of the elements of the sequence (in row-major mode for 2D). Notice that a sequence of a base type, may be saved as a generic sequence (with the element-type byte 0xFF)

Type of sequence Header byte Followed by
Generic on little-endian 0x12 size as 4-byte little-endian int,
    then binary serialization of the elements
Generic on big-endian 0x13 size as 4-byte big-endian int,
    then binary serialization of the elements
Sequence of a base-type 0x14 size as 4-byte little-endian int,
on little-endian   base-type given by header byte in previous
    table, followed by binary dump of elements
Sequence of a base-type 0x15 size as 4-byte big-endian int,
on big-endian   base-type given by header byte in previous
    table, followed by binary dump of elements