Looks a lot like https://bugzilla.roxen.com/bugzilla/show_bug.cgi?id=5072. Here's a bit of what I wrote in that ticket:
Since the previous check didn't catch anything, that means refs are changed while the object is in the gc queue. That's an indication that it has something to do with the clearing of weak references, probably from a mapping since that data structure has the most complex weak ref handling. I still don't understand the chain of events that leads up to the failure, though. Knowing which object is involved could give a hint, and it's the best I can think of right now. The gc internal debug logging would provide the really useful info, but it's not practical to turn on in a production environment since it can easily log half a gig or more for a single gc run.
This is from a fresh 7.8? More specifically, one from cvs after Nov 28th, when I added some debug for [bug 5072]? Can you run with a version compiled using --with-rtldebug? It's not that much slower.
Anyone (mast?) have an idea where to dig for the program_id mentioned?
The id is bigger than PROG_DYNAMIC_ID_START, so it's a program without a fixed id. It's also the fourth program that got registered by low_allocate_program, so it's probably a C program. You could set a breakpoint there and check current_program_id to find out which it is.
Or do I need to add more logging to the gc.c file?
What would be useful is debug when gc_mark_enqueue is run - log the pointer (data) and the number of refs (*(INT32 *) data). But it's probably not feasible if your server has a big memory footprint; not even GC_VERBOSE logs that.
It would probably be interesting to find out when this object was created.
Yes, that could give a clue, but the really valuable info is what kind of things were referencing the object, e.g. some weak mapping as suspected above.
I think it should be possible to add another field to the object which logs the time of creation. Then the backtrace from that moment would be interesting; anyone have a better idea than storing a backtrace for every object as soon as it is created, yet discarding the backtrace (to conserve space) as soon as the object gets a reference?
--with-dmalloc-c-stack-trace does that, but then you're running with dmalloc which is probably too slow. It should be possible to rip out the location tracking from dmalloc so that it only logs this. Then I it could be fast enough. But afaik there's no ready-made define for that.