Subsections

5. Managing software growth

A large library and collection of tools in constant development like PLearn is comparable to a vine which keeps growing branches endlessly. As developers adapt it for diverse uses, leading it simultaneously in different directions, this growth will manifest as an overwhelming tendency to:

This process is natural in the course of development, and is desirable if we want to keep the code base adaptable. However if left unchecked it will lead to its logical undesirable outcome: a huge collection of files, base classes with hundreds of member fields and hundreds of methods (many for obscure special purposes), very long compile and link times and huge executables (due to all the added dependencies) and an installation nightmare on new environments (due to the dependency on a large number of exotic "external libraries").

The aim of this chapter is to raise awareness of these issues, by shedding light on them from several angles. This will hopefully help developers get a better grip on them and take them properly into account when making design decisions (in particular decisions that imply adding a #include in an existing file).

5.1 A few words on the build system

Believe it or not, but the pymake system is designed to optimally link together only the object files that are strictly necessary for a given executable program, and among them to re-compile only those that really need re-compilation.

Now in working with PLearn, many experience the very long compilatin and linking of a huge number of files. I insist that this is not due to the build system, but only to what you, directly or indirectly, include in your executable.

So while it is useful for some purposes to have a plearn executable that includes almost everything, it is clearly not this plearn.cc that you should be compiling linking and working with when you are developing new algorithms to carry experiments on a more limited subject field.

Suggestion:

Using such a mylearn should make the list of files linked together and correpsonding link time a little shorter.

Now when using pymake's parallel compilation facility, the compilation time for plearn projects is usually reasonable. But since linking cannot be parallelized, link time can be problematic, especially when the object files being linked are not on your local machine (due to NFS sluggishness). So it is highly recommended, if you want link time to be acceptable, that linking does not go through NFS (i.e. link on the machine where the files are physically located).

TODO: write about pymake -dependency

The only remaining way to improve the issue of compilation and link time and executable size is to have a sensible dependency graph between files, so that unnecessary files don't get indirectly "included". This can only be achieved by developer awareness and proper restraint when the urge rises to add a #include in an existing file. The following section will try to give a few hints as to how this can be achieved.

Remark 1: You may have had the feeling that if only plearn was made into a proper library or set of libraries (static .a or dynamic .so) the compilation and link time problems would somehow vanish. This is simply untrue: such a system would necessarily be suboptimal compared to the pymake approach, as illustrated in the following example. Suppose we have a number of object files A,B,C,D,E,F,G bound together in a library, and suppose that B,C,D,E,F,G all depend on A but are otherwise independent of each other (ex: they are all direct subclasses of A). Now suppose that your executable only needs G directly. In addition suppose, we had to make a slight modifitation in A.h If you go the library route, regenerating your executable will imply first regenerating the out-of-date library, which means recompiling A,B,C,D,E,F,G and rebuilding the library archive. The linking phase will then build your executable from A and G. But with the pymake approach, only the necessary files A and G will be recompiled (and bound together in the executable), which is a more optimal use of ressources. Having the ability to generate proper libraries may be desirable for other considerations though.

Remark 2, limitation of pymake: for efficiency reasons, pymake doesn't do a full C preprocessor pass, and thus doesn't currently understand the preprocessor logic of #define #ifdef #if ... #else #endif. All it understands is C and C++ comments. Thus if for ex. you #include something within a #ifdef SOMEDEF, whether SOMEDEF happens to be defined or not in the current context, pymake will still conclude there is a dependency link with the #included file. So in short if you want pymake to ignore a #include the only current way is to comment it out.

5.2 How to limit compilation and link dependencies

5.2.1 Compilation dependency versus link dependency

For the rest of our discussion, it is important to distinguish between two different kinds of dependencies:

These two concepts are related, yet subtly different, as will become apparent shortly.

5.2.2 How dependencies tend to creep in, and ways around them

Let us begin with a few remarks. Old-fashioed C libraries tended to put one function per .c file (i.e. per compilation unit). This resulted in libraries with very fine granularity, and only the necessary functions were included and linked in a given executable. But with classes in object oriented C++, the granularity cannot go below the class level (one class per compilation unit). And as a minimum, a class carries with it the compilation and link dependencies implied by all its methods and all their argument types.

Now when people learn C++ object-oriented programming, especially when their background programming experience is a language like Java, there is a tendency to want to make everything a method within a class, because it appears elegant, and the OO way. But it is important to pause and reflect on the consequences and alternatives, especially when adding a method implies adding a #include which adds compile and link dependencies.

An example will better illustrate the alternative possible choices:

Suppose we have two independent classes A (files A.h, A.cc) and B (B.h, B.cc), and we want to add some operation f that requires instances of A and B as arguments (or return value). There are essentially 3 ways to implement such an operation:

Now the consequences in terms of introduced file dependencies are very different.

  1. Adding A::f(B) implies making A depend on B (i.e. you'll no longer be able to link with A without also linking with B, and any change to B.h will at least trigger a recompile of A.cc).
  2. Similarly, adding B::f(A) implies making B depend on A.
  3. On the other hand f(A,B) can be put in a separate file (compilation unit) possibly together with other functions of A and B. This does not create any direct compilation or link dependency between classes A and B, and the file containing f needs only be included (and consequently compiled and linked with) if the specific funcitonality f is needed.

So whenever it appears at first obvious that we need to add a method A::f(B) to a given class A, and that adding such a method forces us to add an #include "B.h" directive, it should trigger a red light in the mind of the developer. The red light is an invitation to consider the other two alternatives. The following considerations should then weigh in the design decision.

TODO: talk about use of forward declaration to reduce compilation dependency, but how it doesn't reduce link dependency.

5.3 Regarding external library dependencies

External dependencies tend to creep in in the codebase, and if unchecked, end up causing a nighightmare for installation, link times, and memory usage of the running software. So it is a good policy to try and limit this, based on the following three guideleines

A few remarks regarding specific external libraries

5.4 Evolving software in a backward-compatible way

TODO: talk about