6.1 The .plearn and .psave Formats

6. Appendix A: File Formats

6.1 The .plearn and .psave Formats

6.1.1 Generalities on mixing ascii and binary

The following characters are in many cases skipped before reading any element: space, tab, newline, carriage-return, comma and semicolon. They are essentially ignored. Binary serialized things should always start with a non-printable ascii character.

6.1.2 TVec and TMat

TVec and TMat will be serialized differently depending on the implicit_storage flag of the PStream they are being written to.

If implicit_storage is set, then serialization won't write the actual whole structure of the TVec or TMat, but will only save the size information and elements as a 1D or 2D sequence (see 6.1.4 and 6.1.5), ex:

4 [ 1.2 3.5 2.8 5.2 ]

3 2 [
0.1    0.2
0.3    0.4
0.5    0.6
]

If implicit_storage is false, then the complete structure of the TVec or TMat with the pointer to its storage (possibly shared with others) will be written explicitly. This corresponds to true, deep serialization.

Ex:

TVec( 4 0 
*1->Storage(4 [ 1.2 3.5 2.8 5.2 ]) )

TMat( 3 2 2 0 
*2->Storage(6 [ 0.1 0.2 0.3 0.4 0.5 0.6 ] ) )

For TVec, we have length offset followed by the storage pointer. For TMat, we have length width mod offset followed by the storage pointer.

This allows to keep structure. For example, if we had a submatrix viewing the second column of the previous TMat, we would have:

TMat( 3 1 2 1
*2 )

6.1.3 Binary PLearn format for base types

To allow mixing of ascii and binary in a file, a non-printable ascii character is used as a one-byte header to identify any binary portion. In Table 6.1 we give the header codes for all basic types

Note that char is considered to be the same as signed char, and long is considered to be the same as int, i.e.: 4-bytes long, which is the case on current architectures.

**Table 6.1:** Binary-header codes for base types
Base type	Byte order	Header byte	Number of bytes to follow
char	-	0x01	1
signed char	-	0x01	1
unsigned char	-	0x02	1
short	little-endian	0x03	2
short	big-endian	0x04	2
unsigned short	little-endian	0x05	2
unsigned short	big-endian	0x06	2
int	little-endian	0x07	4
int	big-endian	0x08	4
unsigned int	little-endian	0x0B	4
unsigned int	big-endian	0x0C	4
long	little-endian	0x07	4
long	big-endian	0x08	4
unsigned long	little-endian	0x0B	4
unsigned long	big-endian	0x0C	4
float	little-endian	0x0E	4
float	big-endian	0x0F	4
double	little-endian	0x10	8
double	big-endian	0x11	8
PRInt64	little-endian	0x16	4
PRInt64	big-endian	0x17	4
PRUint64	little-endian	0x18	4
PRUint64	big-endian	0x19	4

booleans are represented the same way in binary mode as in ascii mode: with the character 0 (for false) or 1 (for true). There is no header byte.
A date (PDate) is written with the header-byte 0xFE followed by a binary serialized double (with appropriate double header) representing the date in YYYYMMDD format.

6.1.4 Ascii PLearn format for a sequence

We consider both one-dimensional sequences ( array, vector, ...) which only have a length, and two-dimensional sequences which have a length and a width.

Ascii-serialized one-dimensional sequences will have the following format:

length [ ... ... ... ]

with the elements of the sequence separated by a single space.

However, on reading, several variations of this format are recognized:

The elements may be separated by any number of blanks (space, tab, newline) and/or commas or semicolons.
The length may be omitted

Ascii-serialized two-dimensional sequences will have the following format:

length width [

... ... ...
... ... ... 
]

with the elements of each row separated by a tab, and the rows separated by a newline.

However on reading, blanks, commas and semi-colons between elements are completely ignored (skipped), so you may format the data as you wish.

2D Sequences are used exclusively for TMats. Notice that it's also possible to make a 1D sequence of 1D sequences, but that's different from a 2D sequence.

6.1.5 Binary PLearn format for a sequence

We consider both one-dimensional sequences ( array, vector, ...) which only have a length, and two-dimensional sequences which have a length and a width.

The following table gives the corresponding header-byte:

Type of sequence	byte-order	Header byte
one-dimensional	little-endian	0x12
one-dimensional	big-endian	0x13
two-dimensional	little-endian	0x14
two-dimensional	big-endian	0x15

All that follows is supposed to be in the byte-order implied by the header-byte.

The first header-byte is followed by an element-type byte giving the nature of the elements in the sequence. It can be either the byte identifying a base-type given in Table 6.1 (the endianness must match), or '0' = 0x30 to indicate a sequence of booleans (1 byte per boolean) or 0xFF to indicate a generic sequence.

The header bytes are followed by one (for 1D sequences) or two (for 2D) 4-byte int to indicate the length (and possibly width) of the sequence. So the total header size for sequences is 6 bytes for 1D sequences and 10 bytes for 2D sequences.

This header is followed by a dump of the elements of the sequence (in row-major mode for 2D). Notice that a sequence of a base type, may be saved as a generic sequence (with the element-type byte 0xFF)

Type of sequence	Header byte	Followed by
Generic on little-endian	0x12	size as 4-byte little-endian int,
		then binary serialization of the elements
Generic on big-endian	0x13	size as 4-byte big-endian int,
		then binary serialization of the elements
Sequence of a base-type	0x14	size as 4-byte little-endian int,
on little-endian		base-type given by header byte in previous
		table, followed by binary dump of elements
Sequence of a base-type	0x15	size as 4-byte big-endian int,
on big-endian		base-type given by header byte in previous
		table, followed by binary dump of elements