I'd like to start a discussion on performance issues where I feel we
can do a lot better with reasonable effort. My test case was initially
the XSLT module in our CMS but with tools such as Shark (an OS X
sampling profiler) I've encountered some interesting results that I've
now reproduced in isolation.
A fundamental observation is of course that CPUs and memory subsystems
have different performance properties today compared to when core
parts of Pike were written. One particular detail is the frequent use
of modulo (or rather int division) in hash tables which on x86_64 (if
I recall correctly) has a latency in the 100+ cycle range. Examples of
where this is used is mappings, identifier lookup, gc passes etc. A
very naive replacement of %= in find_shared_string_identifier() (which
typically scores high in an object-oriented test case like XSLT)
improved benchmark runtime ~3-7% on my Core Duo laptop. My replacement
was a very simple pointer shift/XOR mod 2^n.
Next, enabling some additional Shark probes on misaligned memory
writes together with the int div probe gave very weird results for
code that I thought would be trivial. It turns out that this:
void fn()
{
array arg = ({ 1, 2, 3 });
}
...copies arg recursively using a temporary mapping. Ouch! Consider a
real-world example like the following:
string quote_text(string txt, void|array from, void|array to)
{
from = from || ({ "<", ">", "&" });
to = to || ({ "<", ">", "&" });
return replace(txt, from, to);
}
...which semantically uses the inlined arrays as constants if the
caller doesn't pass from/to. In my mind this would just be a refcount
bump but Grubba pointed out that arrays are not copy-on-write so that
isn't currently possible. A similar test with mappings still passes
copy_svalues_recursively_no_free so apparently the c-o-w property
isn't taken advantage of there either. Adding c-o-w to arrays and
using it in these situations would be really nice.
I also found examples like the zeroing in real_allocate_array() where
gcc (v4.2.1 in my case) generated quite lousy code. A rewrite to a
pointer-based loop with item-chunked initialization was faster. Don't
know if the compiler has to play safe with the repeated dereferencing
of fields with unknown aliasing effects inside unions but I'll gladly
help rewrite such code if people agree it's beneficial.
Finally, anyone try compiling with Clang/llvm-gcc yet? I tested with
llvm-gcc last year and aside from miscompilation in one place it was a
tad slower overall compared to gcc. Maybe recent versions are better?
Can we make a plan for action on these items or do we need the elusive
7.9 branch open first?