An Exercise in Exasperation
Sorry, that's enough of that. But you can see how much this thing irritated me, once I had finally tracked the sucker down. :-)
A strange bug has been plaguing users of Pueblo/UE 2.50 and 2.51; it seems
that whenever they closed a window, the program crashed. The location of
the crash was somewhere in the code that handled unloading the support
modules, of which there are only two in Pueblo/UE: the World module
(which basically handles everything related to connecting to worlds and
interpreting their output), and the Sound module (which naturally enough
handles sound notifications and requests). Moving or deleting the Sound
module so that it could no longer be loaded caused the problem to
disappear, which reinforced the idea that there was a problem in the
Then, finally, working my way towards version 2.52 (adding a few
requested features), the crashing bug suddenly appeared on my
test machines, and I was able to track it down slowly, step by step.
It appeared that the problem was it was trying to free the Sound
module multiple times, instead of just once. However, nothing in
the code seemed wrong - it uses a reference count, which is
incremented when a module is loaded and decremented when the
module is unloaded, and when it reaches zero the module is freed
and removed from the list of modules. But for some unknown
reason it wasn't being removed.
Thy Stack Is Messy
A corruption of local variables means only one thing: something messed up the stack. And with high-level languages, there's typically only two things that can mess up the stack: either you're writing past the edge of a local array (or messing with pointers), or there's a mismatch of function calling conventions. The first case was quickly discarded - simple code examination proved that it didn't have any local arrays to overwrite, and it wasn't doing anything particularly silly with pointers. So it had to be some sort of calling convention mismatch with that one call that freed the module. The trouble is that calling convention mismatches usually cause an immediate crash, they don't often result in minor stack corruption like this.
Follow the Macro Slick Road
On to a hot lead at last, I tracked the call through, following a long chain of macro definitions. The call was to a function pointer of type ChMainHandler, which was a typedef'd function pointer. The typedef itself was carried out by the CH_TYPEDEF_LIBRARY macro, which turned out to be straightforward enough; it just declared the type with the API macro attached - and that turned out to resolve to __stdcall, the basic Win32 calling convention.
Now on to the other end. The entrypoint (because there is only one) in the Sound module is defined simply as ChMain. That turns out to be a macro that defines the function parameters, and itself uses another macro internally - this time CH_GLOBAL_LIBRARY. That one turned out to expand to a whole slew of other macros - C_NAMING DLL_EXPORT chparam CDECL EXPORT - but there was the interesting one: CDECL, which expanded to __cdecl, the standard C calling convention.
So finally it was confirmed - there was a calling convention mismatch, which
was almost certainly the cause of the problems. And with this particular
way around, the mismatch is basically in "stealth" mode - the only difference
between the two conventions is who cleans up afterwards. The main code was
calling the module, the module thought it was a __cdecl so it would
let the main code clean up the stack; after it returned, the main code
thought it had called a __stdcall, so thought the stack had already
been cleaned. The result: nobody cleaned the stack, and it left junk in
all the local variables. But it didn't cause a crash (the eventual crash was
for different reasons).
The basic problem was CH_GLOBAL_LIBRARY expanding to include CDECL (well, actually it doesn't really matter which way around they are, so long as they both agree - but __stdcall is more consistent). When I tracked that one down, and stared at it hard enough, I noticed that there was another definition right next to it that used the API macro instead - which is what I wanted! The decision of which to use was conditional on whether or not the CH_ARCH_32 macro was defined. This macro is defined when compiling for 32-bit Windows (9x/NT4 and up), rather than for 16-bit Windows (3.11/NT3 and down). The code had been written to use API in 16-bit and CDECL in 32-bit, precisely the wrong way around!
And now we come to the final conclusion, the thing that fixed the problem, and why I'm so frustrated (it's always the simplest things that are the most frustrating):
The only thing required to fix it was to add a single exclamation mark!
Hence the title of this page. What made it even more annoying is that I'm positive it was in code that I haven't touched (ever), so it's a bug that must somehow have been inherited from the Pueblo 2.02 source. Yet it couldn't have been present in 2.01, or it would have exhibited the same problems..... sigh.
Well, there you have it, anyway, the full story in most of its gory detail. One programmer's quest in search of the missing exclamation mark. Hope you have more fun reading this than I had experiencing it :-)