The thing is, at least opera mini and most webservers are not really highly concurrent inside a single process, instead multiple processes (2-4 per core, say) are run.
So from the point of view of the server it's actually rather singlethreaded.
That's of course partly because it's harder to write higly efficient perfectly scaling multithreaded applications. Especially in pike. As in, in pike/perl/python it's currently impossible, in C/java/C# it's very hard.
But consider the 32-cores/128 threads case, basically any shared data can cause a rather severe scalability problem, since any mutex can really mess things up if 127 threads are waiting for it.
And even if we are not waiting, simply reading the mutex and the cache-coherency traffic needed for that can slow things down.
This means that you can often just as well use multiple processes with a SHM chunk (or simply use some kind of RPC).
I guess what I am saying is that there should not be a _too_ large penalty for the (rather common) more-or-less-singlethreaded case.