![[UE Logo] - News](/img/toc/index.jpg)










|
|
|
The Examination of the Excised Exclamation
|
An Exercise in Exasperation
The Explanation
Sorry, that's enough of that. But you can see how much this thing irritated
me, once I had finally tracked the sucker down. :-)
Background
A strange bug has been plaguing users of Pueblo/UE 2.50 and 2.51; it seems
that whenever they closed a window, the program crashed. The location of
the crash was somewhere in the code that handled unloading the support
modules, of which there are only two in Pueblo/UE: the World module
(which basically handles everything related to connecting to worlds and
interpreting their output), and the Sound module (which naturally enough
handles sound notifications and requests). Moving or deleting the Sound
module so that it could no longer be loaded caused the problem to
disappear, which reinforced the idea that there was a problem in the
unloading code.
I had seen some stuff in there before that looked a little suspicious;
a convoluted dependency chain is set up internally. The World module
loads the Sound module, which in turn loads the World module. I
altered the load order to try and alleviate the problems, and it
appeared to work on all my test machines. Shortly thereafter, the
program was released as version 2.50.
One user reported the return of the crashing bug, which I found highly
peculiar - it still wasn't bothering me, or anyone else that I asked.
Still, before I could figure out exactly why that one user was having
problems when nobody else was, I had to release 2.51 to sort out a
few other problems.
The release of 2.51 sparked off problems for a few more users;
apparently the crashing bug was more widespread now, for no readily
apparent reason. There was still no sign of it on my test machines.
The whole thing was quite frustrating.
Reappearance
Then, finally, working my way towards version 2.52 (adding a few
requested features), the crashing bug suddenly appeared on my
test machines, and I was able to track it down slowly, step by step.
It appeared that the problem was it was trying to free the Sound
module multiple times, instead of just once. However, nothing in
the code seemed wrong - it uses a reference count, which is
incremented when a module is loaded and decremented when the
module is unloaded, and when it reaches zero the module is freed
and removed from the list of modules. But for some unknown
reason it wasn't being removed.
This was also quite frustrating. In the course of trying to sort
it out I had even completely rewritten most of the unloading
code, but to no avail. After getting it to log everything it
was doing, however, I noticed that after freeing the module the
reference count printed to the log had been corrupted, even
though the next time around it was printed correctly. Further
experimentation revealed that *all* local variables in that
one function got corrupted after freeing the module, which
meant that the code to remove it from the list was never being
executed.
Thy Stack Is Messy
A corruption of local variables means only one thing: something
messed up the stack. And with high-level languages, there's
typically only two things that can mess up the stack: either you're
writing past the edge of a local array (or messing with pointers), or
there's a mismatch of function calling conventions. The first case was
quickly discarded - simple code examination proved that it didn't have any
local arrays to overwrite, and it wasn't doing anything particularly silly
with pointers. So it had to be some sort of calling convention mismatch
with that one call that freed the module. The trouble is that calling
convention mismatches usually cause an immediate crash, they don't often
result in minor stack corruption like this.
Follow the Macro Slick Road
On to a hot lead at last, I tracked the call through, following a long
chain of macro definitions. The call was to a function pointer of type
ChMainHandler, which was a typedef'd function pointer. The typedef
itself was carried out by the CH_TYPEDEF_LIBRARY macro, which
turned out to be straightforward enough; it just declared the type with
the API macro attached - and that turned out to resolve to
__stdcall, the basic Win32 calling convention.
Now on to the other end. The entrypoint (because there is only one) in the
Sound module is defined simply as ChMain. That turns out to be a
macro that defines the function parameters, and itself uses another macro
internally - this time CH_GLOBAL_LIBRARY. That one turned out to
expand to a whole slew of other macros - C_NAMING DLL_EXPORT chparam
CDECL EXPORT - but there was the interesting one: CDECL, which
expanded to __cdecl, the standard C calling convention.
Eureka!
So finally it was confirmed - there was a calling convention mismatch, which
was almost certainly the cause of the problems. And with this particular
way around, the mismatch is basically in "stealth" mode - the only difference
between the two conventions is who cleans up afterwards. The main code was
calling the module, the module thought it was a __cdecl so it would
let the main code clean up the stack; after it returned, the main code
thought it had called a __stdcall, so thought the stack had already
been cleaned. The result: nobody cleaned the stack, and it left junk in
all the local variables. But it didn't cause a crash (the eventual crash was
for different reasons).
If it had been the other way around, it would have caused an immediate crash;
they both would have tried to clean the stack, with the result that
the main code would return to an invalid address and die; if the mismatch had
been with any other calling convention it would also have crashed immediately,
because the other conventions require that parameters are passed in a
different order. But no, it was the single most difficult way around to
detect :-P
Solving
The basic problem was CH_GLOBAL_LIBRARY expanding to include
CDECL (well, actually it doesn't really matter which way around they
are, so long as they both agree - but __stdcall is more consistent).
When I tracked that one down, and stared at it hard enough, I noticed that
there was another definition right next to it that used the API
macro instead - which is what I wanted! The decision of which to use was
conditional on whether or not the CH_ARCH_32 macro was defined. This
macro is defined when compiling for 32-bit Windows (9x/NT4 and up), rather
than for 16-bit Windows (3.11/NT3 and down). The code had been written to
use API in 16-bit and CDECL in 32-bit, precisely the
wrong way around!
Conclusion
And now we come to the final conclusion, the thing that fixed the problem,
and why I'm so frustrated (it's always the simplest things that are the
most frustrating):
The only thing required to fix it was to add a single exclamation
mark!
Hence the title of this page. What made it even more annoying is that
I'm positive it was in code that I haven't touched (ever), so it's a
bug that must somehow have been inherited from the Pueblo 2.02 source.
Yet it couldn't have been present in 2.01, or it would have exhibited
the same problems..... sigh.
Well, there you have it, anyway, the full story in most of its gory
detail. One programmer's quest in search of the missing exclamation
mark. Hope you have more fun reading this than I had experiencing
it :-)
|