Good point.
I was thinking of a way to keep the immediate-destruct semantic: Threads could do a "micro gc" on their own thread local data on each evaluator callback call (i.e. a bit like the current destruct_objects_to_destruct calls). These micro gc's would have to run very quickly though, which probably rules out the generational gc approach with mark-and-sweep for young data (refcount-based garbing remains basically equally efficient regardless how often it runs, while mark-and-sweep does not).
This would mean that the immediate-destruct semantic works as long as the data is thread local, which is true in the mutex-lock-on-the-stack scenario. It's also true in most cases when e.g. arrays are built on the stack using +=. However, cases like
my_map[key] += ({another_value});
would not be destructive on the array values if my_map is shared. But that case is hopeless anyway in a multi-cpu world since the array value always can get refs from other threads asynchronously (prohibiting that would require locking that would be much worse).
To allow destructive updates in such cases, it'd be necessary to introduce some kind of new construct so that the pike programmer explicitly can allow (but not always expect) destructive updates regardless of extra refs.
However, ditching mark-and-sweep for young data comes at a cost. The paper I linked to has measured that a purely refcounting collector is between 20-30% slower when the number concurrent of threads gets above 3 on a 4 cpu box (see pages 80-81). This slowdown is measured over the total throughput of a benchmark, so it's not just "the gc itself".
Note that this is a comparison between two gc's where the only difference is the mark-and-sweep for young data - the purely refcounting collector in this case is still a whole lot more efficient than the current one in pike due to the drastically lowered refcount update frequency. I haven't seen any comparisons between the delayed-update refcount gc and an immediate-update like pike currently uses, but I suspect that the difference is substantial there already.
So the options we're considering here is either keeping the immediate-destruct semantic and some single-ref destructive optimizations, at a cost of (conservatively speaking) at least 15% overall performance in high-concurrency server apps. I don't think the single-ref destructive optimizations can weigh up that performance hit (and in the longer run they can be achieved anyway with new language constructs). Still, keeping the immediate-destruct semantic is worth something from a compatibility view.