I'd like to start a discussion on performance issues where I feel we can do a lot better with reasonable effort. My test case was initially the XSLT module in our CMS but with tools such as Shark (an OS X sampling profiler) I've encountered some interesting results that I've now reproduced in isolation.
A fundamental observation is of course that CPUs and memory subsystems have different performance properties today compared to when core parts of Pike were written. One particular detail is the frequent use of modulo (or rather int division) in hash tables which on x86_64 (if I recall correctly) has a latency in the 100+ cycle range. Examples of where this is used is mappings, identifier lookup, gc passes etc. A very naive replacement of %= in find_shared_string_identifier() (which typically scores high in an object-oriented test case like XSLT) improved benchmark runtime ~3-7% on my Core Duo laptop. My replacement was a very simple pointer shift/XOR mod 2^n.
Next, enabling some additional Shark probes on misaligned memory writes together with the int div probe gave very weird results for code that I thought would be trivial. It turns out that this:
void fn() { array arg = ({ 1, 2, 3 }); }
...copies arg recursively using a temporary mapping. Ouch! Consider a real-world example like the following:
string quote_text(string txt, void|array from, void|array to) { from = from || ({ "<", ">", "&" }); to = to || ({ "<", ">", "&" }); return replace(txt, from, to); }
...which semantically uses the inlined arrays as constants if the caller doesn't pass from/to. In my mind this would just be a refcount bump but Grubba pointed out that arrays are not copy-on-write so that isn't currently possible. A similar test with mappings still passes copy_svalues_recursively_no_free so apparently the c-o-w property isn't taken advantage of there either. Adding c-o-w to arrays and using it in these situations would be really nice.
I also found examples like the zeroing in real_allocate_array() where gcc (v4.2.1 in my case) generated quite lousy code. A rewrite to a pointer-based loop with item-chunked initialization was faster. Don't know if the compiler has to play safe with the repeated dereferencing of fields with unknown aliasing effects inside unions but I'll gladly help rewrite such code if people agree it's beneficial.
Finally, anyone try compiling with Clang/llvm-gcc yet? I tested with llvm-gcc last year and aside from miscompilation in one place it was a tad slower overall compared to gcc. Maybe recent versions are better?
Can we make a plan for action on these items or do we need the elusive 7.9 branch open first?