Subsections

5. Managing software growth

A large library and collection of tools in constant development like PLearn is comparable to a vine which keeps growing branches endlessly. As developers adapt it for diverse uses, leading it simultaneously in different directions, this growth will manifest as an overwhelming tendency to:

add new classes and files (that's fine!)
extend existing methods of base classes by adding extra arguments (possibly with default values)
add new member fields to existing base classes
add new methods to existing base classes
add new dependencies between existing files/classes (by adding new #include directives, due for example to the addition of a method taking a novel type of argument which needs to be "included").
add dependencies to new external libraries/tools

This process is natural in the course of development, and is desirable if we want to keep the code base adaptable. However if left unchecked it will lead to its logical undesirable outcome: a huge collection of files, base classes with hundreds of member fields and hundreds of methods (many for obscure special purposes), very long compile and link times and huge executables (due to all the added dependencies) and an installation nightmare on new environments (due to the dependency on a large number of exotic "external libraries").

The aim of this chapter is to raise awareness of these issues, by shedding light on them from several angles. This will hopefully help developers get a better grip on them and take them properly into account when making design decisions (in particular decisions that imply adding a #include in an existing file).

5.1 A few words on the build system

Believe it or not, but the pymake system is designed to optimally link together only the object files that are strictly necessary for a given executable program, and among them to re-compile only those that really need re-compilation.

Now in working with PLearn, many experience the very long compilatin and linking of a huge number of files. I insist that this is not due to the build system, but only to what you, directly or indirectly, include in your executable.

So while it is useful for some purposes to have a plearn executable that includes almost everything, it is clearly not this plearn.cc that you should be compiling linking and working with when you are developing new algorithms to carry experiments on a more limited subject field.

Suggestion:

Make a copy of plearn.cc into mylearn.cc and edit it to include only the few classes you (or your script) will need.
Don't include something like plearn_inc.h which includes a large number of files you probably don't need. Rather copy-paste the content of plearn_inc.h into your mylearn.cc and comment out all the things you won't need.
Don't commit your mylearn.cc under version control (so that it remains yours and seperate from the others' mylearn).

Using such a mylearn should make the list of files linked together and correpsonding link time a little shorter.

Now when using pymake's parallel compilation facility, the compilation time for plearn projects is usually reasonable. But since linking cannot be parallelized, link time can be problematic, especially when the object files being linked are not on your local machine (due to NFS sluggishness). So it is highly recommended, if you want link time to be acceptable, that linking does not go through NFS (i.e. link on the machine where the files are physically located).

TODO: write about pymake -dependency

The only remaining way to improve the issue of compilation and link time and executable size is to have a sensible dependency graph between files, so that unnecessary files don't get indirectly "included". This can only be achieved by developer awareness and proper restraint when the urge rises to add a #include in an existing file. The following section will try to give a few hints as to how this can be achieved.

Remark 1: You may have had the feeling that if only plearn was made into a proper library or set of libraries (static .a or dynamic .so) the compilation and link time problems would somehow vanish. This is simply untrue: such a system would necessarily be suboptimal compared to the pymake approach, as illustrated in the following example. Suppose we have a number of object files A,B,C,D,E,F,G bound together in a library, and suppose that B,C,D,E,F,G all depend on A but are otherwise independent of each other (ex: they are all direct subclasses of A). Now suppose that your executable only needs G directly. In addition suppose, we had to make a slight modifitation in A.h If you go the library route, regenerating your executable will imply first regenerating the out-of-date library, which means recompiling A,B,C,D,E,F,G and rebuilding the library archive. The linking phase will then build your executable from A and G. But with the pymake approach, only the necessary files A and G will be recompiled (and bound together in the executable), which is a more optimal use of ressources. Having the ability to generate proper libraries may be desirable for other considerations though.

Remark 2, limitation of pymake: for efficiency reasons, pymake doesn't do a full C preprocessor pass, and thus doesn't currently understand the preprocessor logic of #define #ifdef #if ... #else #endif. All it understands is C and C++ comments. Thus if for ex. you #include something within a #ifdef SOMEDEF, whether SOMEDEF happens to be defined or not in the current context, pymake will still conclude there is a dependency link with the #included file. So in short if you want pymake to ignore a #include the only current way is to comment it out.

5.2 How to limit compilation and link dependencies

5.2.1 Compilation dependency versus link dependency

For the rest of our discussion, it is important to distinguish between two different kinds of dependencies:

Saying that a file Y has a compilation dependency on a file X means that whenever X changes, Y is to be considered out-of-date and will have to be recompiled if we need an up-to-date version of Y. Ex: Y #includes X or includes another file that includes another file that includes X, etc...In short compilation dependencies affect which files will have to be "recompiled" (and thus affect re-compilation time).
Saying that a file Y has a link dependency on a file X means that whenever we need to link with the corresponding object file "X.o" we will also have to link with "Y.o". In short link dependencies affect which files will have to be linked together to produce an executable (and thus directly affects link time).

These two concepts are related, yet subtly different, as will become apparent shortly.

5.2.2 How dependencies tend to creep in, and ways around them

Let us begin with a few remarks. Old-fashioed C libraries tended to put one function per .c file (i.e. per compilation unit). This resulted in libraries with very fine granularity, and only the necessary functions were included and linked in a given executable. But with classes in object oriented C++, the granularity cannot go below the class level (one class per compilation unit). And as a minimum, a class carries with it the compilation and link dependencies implied by all its methods and all their argument types.

Now when people learn C++ object-oriented programming, especially when their background programming experience is a language like Java, there is a tendency to want to make everything a method within a class, because it appears elegant, and the OO way. But it is important to pause and reflect on the consequences and alternatives, especially when adding a method implies adding a #include which adds compile and link dependencies.

An example will better illustrate the alternative possible choices:

Suppose we have two independent classes A (files A.h, A.cc) and B (B.h, B.cc), and we want to add some operation f that requires instances of A and B as arguments (or return value). There are essentially 3 ways to implement such an operation:

As a new method of A, taking a B as argument A::f(B)
As a new method of B, taking an A as argument B::f(A)
As a regular function of A and B f(A,B)

Now the consequences in terms of introduced file dependencies are very different.

Adding A::f(B) implies making A depend on B (i.e. you'll no longer be able to link with A without also linking with B, and any change to B.h will at least trigger a recompile of A.cc).
Similarly, adding B::f(A) implies making B depend on A.
On the other hand f(A,B) can be put in a separate file (compilation unit) possibly together with other functions of A and B. This does not create any direct compilation or link dependency between classes A and B, and the file containing f needs only be included (and consequently compiled and linked with) if the specific funcitonality f is needed.

So whenever it appears at first obvious that we need to add a method A::f(B) to a given class A, and that adding such a method forces us to add an #include "B.h" directive, it should trigger a red light in the mind of the developer. The red light is an invitation to consider the other two alternatives. The following considerations should then weigh in the design decision.

Obviously, if A::f(B) is to be a virtual method designed to be redefined in sub-classes of A, then there is no discussing it, it should be A::f(B). But if there is no reason to expect that f needs to be virtual in order to be redefined in sub-classes, then the alternatives can and should be considered.
If f is a rarely needed functionality with a large implementation code, it is probably best left as a separate function in its own file (posibly with other similar functions).
It may be nicer to group f(A,B) with other functions relating to a same topic in a separate file, rather than unreasonably increase the number of methods of A.
But if A is meant to almost always work together with B (i.e. if it makes no sense having an A without also manipulating some kind of B), or if A already depends on B due to some other reason (such as having a member variable of type B) then A::f(B) is probably a reasonable design choice.
When hesitating between A::f(B) and B::f(A) the choice should be to make the least basic, higher level, least used class depend on the most basic, lower level, more widely used class, rather than the other way round.

TODO: talk about use of forward declaration to reduce compilation dependency, but how it doesn't reduce link dependency.

5.3 Regarding external library dependencies

External dependencies tend to creep in in the codebase, and if unchecked, end up causing a nighightmare for installation, link times, and memory usage of the running software. So it is a good policy to try and limit this, based on the following three guideleines

If something similar already exists within PLearn, prefer using that.
If it can easily be done without introducing a new dependency, then do it that way (even if it feels a little less cool.
Prefer using an external library that is already used in PLearn (and approved) than a introducing new one.

A few remarks regarding specific external libraries

Use PLearn's intrusive smart pointers (PP, ...) for all PLearn objects. Use boost's shared_ptr for non PLearn classes.
Use PLearn's PStream and serialization mechanism rather than any other C++ streams or serialization system (including std::stream).

5.4 Evolving software in a backward-compatible way

TODO: talk about

Tolerant, evolution-friendly, serialization format
class versioning
new version of class (filename scheme)