Saturday, April 29, 2006

Build flags declared bad for your health

Too many evils...

We keep getting bitten by this at work. We have several customers using the same codebase, with the only difference being how they are compiled. This isn't a case of one customer needing optimisation, and another not (we have that too). It's completely different code behaviours defined through build flags.

So, let me say it right now. Changing the behaviour of code through a compile time options and then shipping those variants to different customers is bad. Take that compile time option and make it run time. Otherwise, you can be sure that you will ship the wrong binary. We've done it so many times, it's becoming silly.

Your goal should be to have anyone check out, compile and ship your code to any customer using the same commands

There are several ways you can get rid of these options. My first preference is to get rid of it completely. Decide on a way the code will work and go with it. Why do you have the option at all? Is it faster in some way? Why isn't it good enough for all of your customers?

Next in order, I would get rid of the compile time flag completely and make it a run time configuration option. Make your system more configurable! You do have to be careful that you are providing something useful though. Why do you need to decide at all? Is it possible for the code to decide for you? (see previous post on configuration options).

Finally, if you absolutely must have a compile time option (for performance reasons or otherwise), build and ship all variants to all customers. Delay decisions as long as possible. It may take you another 2-3 hours to build both variants, but it will save you a round trip to your customer.

All of these will save you the embarrassment of yet again having your customer come back to you and say, "Don't you know how to compile your code?".

I hate hearing that question.

Inlines are Evil!

We just ran into an interesting problem with inline functions and shared libraries. Consider the following piece of code:

// inlineContainer.h
class inlineContainer {
     inline void setA(int a) {
         _a = a;      
     }

     int _a;
     int _b;
     void setB(int b);
};

Now, consider the following piece of code (test.cc):

#include "inlineContainer.h"

void testMe() {
   inlineContainer mine;

   mine.setB(1);
}

The file inlineContainer.cc contains the implementation for setB. It is compiled into a shared library. Additionally, nothing in your test.cc file calls setA. Most people would expect that when you compile test.o, it wouldn't have any symbols from inlineContainer defined.

They'd be wrong.

The C++ standard says that the compiler should include a non-inline version of a function, just in case it can't inline it all of the time. GCC doesn't wait for the function to be called before it creates a copy. It creates a copy as soon as the inline is seen. So, test.o ends up with a copy of setA, albeit a weak one.

Still, that wouldn't be too much of a problem in itself, however, the shared library will also have a weak copy. This leads us to the actual problem, what happens when we want to change the implementation of setA? Even though we have a copy of the new version in the shared library, unless we recompile test.cc, we will more than likely end up with an interestingly hard to find bug.

The problem is the linker. When there are multiple copies of the symbol that are of equal "weakness", the first one seen by the linker is used. Since the shared library will be seen after the old one in test.o, the linker will use that one.

So, we end up with a very strange bug. It won't happen every time the function is called, only in the situations where the compiler decides the function shoudln't (or can't) be inlined. That means that invocations separated by a couple of lines will have vastly different behaviours.

Of course, this all breaks the C++ "one definition rule", which requires an inline function to be defined with identical meaning in every translation unit. Additionally, compilers and linkers are not required to detect violations of the rule. GCC is obviously one of the ones that can't. ref: http://en.wikipedia.org/wiki/One_Definition_Rule

How do we fix it?

First off, don't expose inlines to code that isn't going to be part of the same shared library. It isn't part of your exposed interface, don't let them see it. Do this through private, friends, any favourite way you have. One way would be to put the inlines inside a #if that is only presented to code that is allowed to see them.

However, that won't solve the problem for code that has already been delivered. That is harder. In fact, I'm not sure it's even possible. Testing with G++ produces interesting results:

Consider two files:

// test.cc
#include 
#include 
#include 

class inlineContainer {
 public:
   void setA(int a);
   void setB(int b);
   int _a;
   int _b;
};

inline void inlineContainer::setA(int a) {
   _a = 1;
}

void inlineContainer::setB(int b) {
   _b = b;
}

void test_func(inlineContainer &a);

int main(int argc, char **argv){
   inlineContainer funky;

   test_func(funky);
   std::cerr <<>

class inlineContainer {
 public:
   void setA(int a);
   void setB(int b);
 private:
   int _a;
   int _b;
};

void inlineContainer::setA(int a) {
   _a = a;
}

void test_func(inlineContainer &a) {
 a.setA(15);
}

# compilation instructions
g++ -fPIC -c test2.cc
g++ -shared test2.o -o libtest2.so
g++ test.cc libtest2.so -o testA
g++ test.cc test2.o -o testB

Now, what happens when we run testA?

[jpollock@pollock ~]$ ./testA
1
1

When we run testB?

[jpollock@pollock ~]$ ./testB
15
5

As you can see, the shared library version ALWAYS uses the copy produced in the main executable, even though it is defined as weak. The staticly linked version uses the newer version in test2.cc, since it is a normal, strong symbol.

It becomes even more interesting if we change test.cc to remove the call to setA. When we recompile an re-run, we see that while setA is defined in testA, the version in the shared library is the one that is used.

That indicates that regardless of the weakness of the symbol, if it's in the main executable and called, it will always be used in preference to the one in the shared library.

Scary, huh? That means we can't override that symbol. Even if we ignore the problems presented by versions that are actually inlined, we still cannot override that symbol!

That indicates to me that the only way we can change setA is to either hack the name of setA to ensure that the new version of the library has a different name, or redeliver everything that contains setA as a weak (or otherwise) symbol.

Of course, you could be smart and not expose inline functions outside of your library.