[UE Logo] - News
---
Client
Documentation
FAQ
Privacy Policy
Incoming Features
Source Code FAQ
Feedback
World Lists
Community

SourceForge Logo Pueblo Now!
   

 

 The Examination of the Excised Exclamation


An Exercise in Exasperation

The Explanation

Sorry, that's enough of that. But you can see how much this thing irritated me, once I had finally tracked the sucker down.  :-)

Background

A strange bug has been plaguing users of Pueblo/UE 2.50 and 2.51; it seems that whenever they closed a window, the program crashed. The location of the crash was somewhere in the code that handled unloading the support modules, of which there are only two in Pueblo/UE: the World module (which basically handles everything related to connecting to worlds and interpreting their output), and the Sound module (which naturally enough handles sound notifications and requests). Moving or deleting the Sound module so that it could no longer be loaded caused the problem to disappear, which reinforced the idea that there was a problem in the unloading code.

I had seen some stuff in there before that looked a little suspicious; a convoluted dependency chain is set up internally. The World module loads the Sound module, which in turn loads the World module. I altered the load order to try and alleviate the problems, and it appeared to work on all my test machines. Shortly thereafter, the program was released as version 2.50.

One user reported the return of the crashing bug, which I found highly peculiar - it still wasn't bothering me, or anyone else that I asked. Still, before I could figure out exactly why that one user was having problems when nobody else was, I had to release 2.51 to sort out a few other problems.

The release of 2.51 sparked off problems for a few more users; apparently the crashing bug was more widespread now, for no readily apparent reason. There was still no sign of it on my test machines. The whole thing was quite frustrating.

Reappearance

Then, finally, working my way towards version 2.52 (adding a few requested features), the crashing bug suddenly appeared on my test machines, and I was able to track it down slowly, step by step. It appeared that the problem was it was trying to free the Sound module multiple times, instead of just once. However, nothing in the code seemed wrong - it uses a reference count, which is incremented when a module is loaded and decremented when the module is unloaded, and when it reaches zero the module is freed and removed from the list of modules. But for some unknown reason it wasn't being removed.

This was also quite frustrating. In the course of trying to sort it out I had even completely rewritten most of the unloading code, but to no avail. After getting it to log everything it was doing, however, I noticed that after freeing the module the reference count printed to the log had been corrupted, even though the next time around it was printed correctly. Further experimentation revealed that *all* local variables in that one function got corrupted after freeing the module, which meant that the code to remove it from the list was never being executed.

Thy Stack Is Messy

A corruption of local variables means only one thing: something messed up the stack. And with high-level languages, there's typically only two things that can mess up the stack: either you're writing past the edge of a local array (or messing with pointers), or there's a mismatch of function calling conventions. The first case was quickly discarded - simple code examination proved that it didn't have any local arrays to overwrite, and it wasn't doing anything particularly silly with pointers. So it had to be some sort of calling convention mismatch with that one call that freed the module. The trouble is that calling convention mismatches usually cause an immediate crash, they don't often result in minor stack corruption like this.

Follow the Macro Slick Road

On to a hot lead at last, I tracked the call through, following a long chain of macro definitions. The call was to a function pointer of type ChMainHandler, which was a typedef'd function pointer. The typedef itself was carried out by the CH_TYPEDEF_LIBRARY macro, which turned out to be straightforward enough; it just declared the type with the API macro attached - and that turned out to resolve to __stdcall, the basic Win32 calling convention.

Now on to the other end. The entrypoint (because there is only one) in the Sound module is defined simply as ChMain. That turns out to be a macro that defines the function parameters, and itself uses another macro internally - this time CH_GLOBAL_LIBRARY. That one turned out to expand to a whole slew of other macros - C_NAMING DLL_EXPORT chparam CDECL EXPORT - but there was the interesting one: CDECL, which expanded to __cdecl, the standard C calling convention.

Eureka!

So finally it was confirmed - there was a calling convention mismatch, which was almost certainly the cause of the problems. And with this particular way around, the mismatch is basically in "stealth" mode - the only difference between the two conventions is who cleans up afterwards. The main code was calling the module, the module thought it was a __cdecl so it would let the main code clean up the stack; after it returned, the main code thought it had called a __stdcall, so thought the stack had already been cleaned. The result: nobody cleaned the stack, and it left junk in all the local variables. But it didn't cause a crash (the eventual crash was for different reasons).

If it had been the other way around, it would have caused an immediate crash; they both would have tried to clean the stack, with the result that the main code would return to an invalid address and die; if the mismatch had been with any other calling convention it would also have crashed immediately, because the other conventions require that parameters are passed in a different order. But no, it was the single most difficult way around to detect :-P

Solving

The basic problem was CH_GLOBAL_LIBRARY expanding to include CDECL (well, actually it doesn't really matter which way around they are, so long as they both agree - but __stdcall is more consistent). When I tracked that one down, and stared at it hard enough, I noticed that there was another definition right next to it that used the API macro instead - which is what I wanted! The decision of which to use was conditional on whether or not the CH_ARCH_32 macro was defined. This macro is defined when compiling for 32-bit Windows (9x/NT4 and up), rather than for 16-bit Windows (3.11/NT3 and down). The code had been written to use API in 16-bit and CDECL in 32-bit, precisely the wrong way around!

Conclusion

And now we come to the final conclusion, the thing that fixed the problem, and why I'm so frustrated (it's always the simplest things that are the most frustrating):

The only thing required to fix it was to add a single exclamation mark!

Hence the title of this page. What made it even more annoying is that I'm positive it was in code that I haven't touched (ever), so it's a bug that must somehow have been inherited from the Pueblo 2.02 source. Yet it couldn't have been present in 2.01, or it would have exhibited the same problems..... sigh.


Well, there you have it, anyway, the full story in most of its gory detail. One programmer's quest in search of the missing exclamation mark. Hope you have more fun reading this than I had experiencing it :-)


uecasm@users.sf.net
Website copyright © 2001-05 by Ultra Enterprises.
Chaco, Pueblo, and the Chaco logo were trademarks of Andromedia, Inc. (About)
Ultra Enterprises, UE, Pueblo/UE, and the UE logo are trademarks of Gavin Lambert. (About)