Subsections

6. PLearn coding guidelines and philosophy

Several people wrote significant parts of PLearn, and if you take a closer look at the code, you will see a number of clearly different coding styles and philosophies. However, as of this writing (09/2000), the overall design and organization of most of the library is still to be blamed (or praised...) on me. So these remarks are my personal view of things, and do not necessarily reflect the opinion of everybody on the PLearn developer team, but I hope it will help you understand the reasons why things are the way they are, and hopefully have you choose to keep them that way...

Pascal

6.1 A few words on C++

Agreed, C++ can be a very complex language. The main reason being that it is extremely feature-rich, but that is also what makes C++ so powerful and expressive, and thus appropriate for a machine-learning library. Yet I insist on the can be : it doesn't always have to be, it depends a great deal on what features you choose to use and when.

People who discover C++ tend to first be overwhelmed with its wealth of features, and then seem to want to use them all at once in even the simplest piece of code (complex templates, deep multiple inheritance trees, exceptions, multiple nested namespaces; add multi-threading on top of that and you're sure to write the most unreadable, unportable and compiler-bug trigerring error-prone code ever). Finally, after great intellectual efforts, they discover that even their compiler (not to mention their debugger!) has trouble understanding it all and, if they manage to have it swallow the code, they realise that no other compiler will (portability anybody?). This is still quite true as of this writing (09/2000) and was even more so a few years ago, yet tools will keep improving until some day, hopefully, they all behave perfectly by the book, according to the standard, but until this blessed day comes, beware... Many people then give up, frustrated, and decide to go back to C, which is a shame. C++ is a much better language than C, especially for writing Object Oriented code, and it does make the programmer's life much easier... as long as you keep things simple.

So please, especially if you're a beginner, keep this in mind when writing C++ code: having so many “cool” features in the language doesn't mean that you must use them all at once. Choose wisely and, if in doubt, always prefer the simplest solution...

6.2 Design goals and priorities

Any project implicitly or explicitly sets some goals and these directly influence the way code is written. With PLearn, one of the founding goal, was to be able to describe complex machine-learning experiments by assembling simple building-blocks directly in C++, without resorting to a layer of home-grown dedicated language (as experience had proven us that it is hard to grow and maintain such a language, which appears always too limited anyway). Obviously we also want to have them run efficiently (hence the choice of C++ rather than a higher level interpreted language).

Any system should ideally be simple to understand and use, lightning fast, and extremely geneal. Yet there is always a tradeoff to be made between these 3 highly desirable characteristics. Here is the priority I gave them, in the design of the library, it logically follows from the project's primary goals:

  1. readability, simplicity, ease of use (and portability)
  2. computational efficiency
  3. genericity

6.3 Usage of C++ features in PLearn

As I mentionned earlier, moderation is good in everything, including in moderation... ;)

Function and method prototypes without parameter names

C++ code is typically divided between .h files which contain class layout, function and method prototypes, and .cc files which contain the actual implementation. Ideally, it should be possible to understand what a method or function does by looking it up only in the .h file. Comments are part of achieving this, but having a meaningful name for the parameters of the function also helps a great deal.

C++ allows you to omit parameter names in prototypes (and only give their types). This defies the purpose of clarity, and is thus considered bad practice by the author and in PLearn in general. Except for possible default values (that are to appear only in the .h), the prototype in the .h file should be identical to the definition in the .cc file and include parameter names.

(Ex: people usually have trouble understanding what float* f(float*, int, int, char, char*, float); is supposed to do, and defining a new type for each argument is not the right way of making this more understandable... giving them a meaningful name is.)

Basic data types

Conceptually, people usually think of 3 simple basic data types: integers, reals, and booleans (possibly 4 if you add character). C++ has them in many flavours, including signed and unsigned, several precisions, etc. These all have their use from a low-level hardware perspective (which woud have been much better if they had been given standard byte sizes by the standard...), but to the mathematically minded library-user they are an annoyance. So throughout most of PLearn, unless otherwise dictated by low-level precision or space considerations, we use only 3 types that correspond to the 3 concepts:

Also we encourage people not to define a new type if it conceptually corresponds to one of those three concepts, in particular I for one (and I'm surely not alone) dislike to have to write
namespace::subnamespace::classname::interiorclass::length_type
when the damn thing is just an integer, if you get my point. Please use int, it saves the user keystrokes, code lookup time, and eases understanding (i.e.: genericity- - but simplicity++ and ease_of_use++, see section on desing priorities above).

The use of unsigned int types is also a source of annoyance to me, and of potential nasty bugs. Ex: for (unsigned int i=10; i>=0; --i)

So again, unless you really need the extra bit of precision, use int (also saves a few keystrokes).

A kind of string type is also usually seen as part of the set of basic types, but we'll discuss this in the section on the standard library.

Namespaces

Namespaces are most useful to prevent name clashes between different libraries. So ultimately, all of PLearn is to reside in the PLearn namespace. However gdb currently seems to have trouble coping with them, so the namespace directives are currently surrounded by ugly #ifdef USENAMESPACE which we usually keep undefined.

Also, for now, I do not encourage the use of sub-namespaces to organize the code within PLearn (with or without #ifdefs). It's already hard enough to get the organization right in terms of concepts, class hierarchies, and files, without introducing yet another hierarchy of things (which besides, would go mostly untested as we always compile with USENAMESPACE undefined, for now anyway).

Exceptions and runtime errors

Exceptions can be a nice and useful feature, allowing you to build sophisticated error recovery mechanisms and the like... But designing a consistent error-recovery scheme with an appropriate exception class hierarchy is a complex task. Besides in PLearn, we typically have no use for a sophisticated error recovery mechanism: a runtime error is always a sign of a bug somewhere, and the policy in PLearn is to never try to second-guess the programmer: all we want is for the program to abort immediately with a somewhat meaningful message, and the debugger to be able to trace the call. Unfortunately, as of this writing, exceptions are poorly supported by debuggers (and they can create a nightmare in multi-threaded code).

So essentially we don't use exceptions in PLearn, but a very simple runtime error mechanism: error("my meaningful error message"); will result in a call to function errormsg that simply prints out the message and exits the program. Thus it is easy to set a breakpoint in errormsg in the debugger and trace what happened. This is a no-fuss solution that does the job. Notice that the errormsg function can easily be modified to throw an exception if you wish to do a proper error recovery (in case brutally exiting the program is not an acceptable behaviour).

Exceptions can also be useful for other things, but for typical runtime-errors, please use error(errormsg).

Templates

Templates is one of the most powerful features of C++. But it's also the most complex, and the one with which compilers and debuggers have the most trouble (almost all but the simplest template code is hardly portable across compilers because of inconsistencies between them, and it was much worse a few years back!). The early versions of PLearn deliberately did not use any template code at all (many other librariy designers out there for whom portability was a major concern made the same choice).

As the compilers improved, I started allowing myself to use simple templates for things where they were really appropriate, (i.e. smart pointers and generic containers). And I would recommend everybody to stick to this. Please, refrain from using templates as much as possible: it will make your code easier to write, to read, to debug, to port, to understand, and also faster to compile. It's usually easy to later “templatize” a working and well-tested non-templated code if really needed. But it's always annoying to have to “de-templatize” a complex template code because the compiler on your new target platform cannot understand it (chances are that you won't either).

Multiple inheritance and complex class hierarchies

Multiple inheritance poses a number of technical problems and a multiple inheritance tree is also usually more difficult to understand conceptually. Therefore, PLearn uses only single inheritance and I would like to keep it that way. The only kind of “multiple inheritance” that we have is for inheriting interfaces (à la Java) i.e. abstract classes with only purely virtual methods.

Also we often use concrete classes, and in general prefer flat class hierarchies than very deep ones, as they are easier to comprehend.

const

const is number one on my list of C++ annoyances. But unfortunately there is no way to really do without it, so try to use it consistently, and try not to get too frustrated in case of code constipation, pardon me, const problems... there is always a (hopefully clean) way around them.

public, private, protected

There are probably too many class members that are public in PLearn. But, as we love our potential library users (they are mostly us for now anyway), we tend to avoid paranoia, and to trust them for not doing dirty things with our not-so-private members. Hell, they have access to the source code anyway!

6.4 Usage of the C++ standard library in PLearn

In early versions of PLearn, we did not use much of the standard library (as no compilers yet agreed on a standard), except for iostreams. Now that there is a well established standard, and that all compiler makers are working towards conforming to it, we are slowly moving PLearn to using more of the standard library facilities.

Strings

Many places in PLearn still use char* to represent strings, but they'll slowly be changed into using the std::string class instead. Please use string from now on. Feel free to change any usage of char* you meet into string.

A number of useful additional functions for user-friendly string manipulation can be found in file PLearnCore/stringutils.h A brief (and certainly not up to date) description of it, as well as a pointer to a quick overview of the basic string operations can be found here.

Streams

Several pieces of old PLearn code still use the C stream library (FILE* ...), but the standard C++ stream facilities is the officially approved way to go for new code.

Standard containers and algorithms

It's now OK to use STL containers wherever appropriate. Two other generic containers were previously developed for PLearn: Array is heavily used, and is a base class for a number of other specialised array types, so it is not likely to vanish any time soon (although I may have it derive from std::vector one of these days). The main advantage of Array over std::vector is that runtime bound-checking can be turned on or off with a compilation flag (BOUNDCHECK), and there's also a user-friendly (but inefficient) syntax to build arrays from simple elements using the & operator. Hash may also be progressively abandoned in favour of std::map, hash_map (is this one part of the C++ standard?) and the like...

6.5 Naming conventions

The following naming conventions are used throughout PLearn. They are mostly inspired by the Java naming conventions. Anybody who uses or wishes to extend PLearn should be aware of them (as it makes understanding of the code easier) and try to respect them (as it will make the understanding of their code easier to other people who will have the privilege to dig into it).

To make it short and simple:

Remarks:

A few reasonable exceptions are tolerated throughout the code (such as function P for probability instead of a lowercase p, or a member variable K for a kernel matrix...) But exceptions that don't serve any purpose should not be!

6.6 Final word

The PLearn library is far from perfect, it still has a lot of rough edges (my to-do list is growing every day), and there are several things that I would do differently if I was to start all over again. But it is nevertheless already a very usable tool, that for the most part, I feel, meets its primary design goals. Besides I consider good code design an iterative process: one starts with an initially rough version and iteratively refines it under the light of real-world experience. The code base is not carved in stone, it is an evolving being, and the source code is there so that you can tweak it and adapt it to your needs, and hopefully help make it better.