Yes, that the threads in both our server cases are mostly independent is a key factor why a multi-cpu capable pike is a worthwhile improvement.
Even in these cases there are clear benefits by offloading the freeing to another thread (which presumably can run on another cpu most of the time). Consider the work needed to handle a thing allocated on the heap:
1. Allocate the thing. 2. Write references to it on the stack and in other things. 3. Increment and decrement its refcounter. 4. Free it.
With mark-and-sweep gc, item 3 disappears completely and 4 is offloaded to a different cpu. This obviously saves time, even if the thing is completely thread local.
But consider the 32-cores/128 threads case, basically any shared data can cause a rather severe scalability problem, since any mutex can really mess things up if 127 threads are waiting for it.
Yes, that's why I consider any such lock unacceptable, and that's why lock-free (and preferably also memory fence free) algorithms are so sexy. The pike core must not have any lock or other hotspot that every thread is forced to access with any significant frequently. There might be such things in the OS (which we can't do anything about), and there might be such things in the pike app, but that's up to the pike app to solve - the core can only provide adequate tools to do it.