I'd like to start a discussion on performance issues where I feel we can do a lot better with reasonable effort. My test case was initially the XSLT module in our CMS but with tools such as Shark (an OS X sampling profiler) I've encountered some interesting results that I've now reproduced in isolation.
A fundamental observation is of course that CPUs and memory subsystems have different performance properties today compared to when core parts of Pike were written. One particular detail is the frequent use of modulo (or rather int division) in hash tables which on x86_64 (if I recall correctly) has a latency in the 100+ cycle range. Examples of where this is used is mappings, identifier lookup, gc passes etc. A very naive replacement of %= in find_shared_string_identifier() (which typically scores high in an object-oriented test case like XSLT) improved benchmark runtime ~3-7% on my Core Duo laptop. My replacement was a very simple pointer shift/XOR mod 2^n.
Next, enabling some additional Shark probes on misaligned memory writes together with the int div probe gave very weird results for code that I thought would be trivial. It turns out that this:
void fn() { array arg = ({ 1, 2, 3 }); }
...copies arg recursively using a temporary mapping. Ouch! Consider a real-world example like the following:
string quote_text(string txt, void|array from, void|array to) { from = from || ({ "<", ">", "&" }); to = to || ({ "<", ">", "&" }); return replace(txt, from, to); }
...which semantically uses the inlined arrays as constants if the caller doesn't pass from/to. In my mind this would just be a refcount bump but Grubba pointed out that arrays are not copy-on-write so that isn't currently possible. A similar test with mappings still passes copy_svalues_recursively_no_free so apparently the c-o-w property isn't taken advantage of there either. Adding c-o-w to arrays and using it in these situations would be really nice.
I also found examples like the zeroing in real_allocate_array() where gcc (v4.2.1 in my case) generated quite lousy code. A rewrite to a pointer-based loop with item-chunked initialization was faster. Don't know if the compiler has to play safe with the repeated dereferencing of fields with unknown aliasing effects inside unions but I'll gladly help rewrite such code if people agree it's beneficial.
Finally, anyone try compiling with Clang/llvm-gcc yet? I tested with llvm-gcc last year and aside from miscompilation in one place it was a tad slower overall compared to gcc. Maybe recent versions are better?
Can we make a plan for action on these items or do we need the elusive 7.9 branch open first?
We really really want a 7.9 anyway.
But since that's not likely to happen I'll bring up some performance issues as well:
1: In the Opera Mini code the function using most CPU time is the one indexing an object (using an identifier id, not string).
It might be worth looking into why this is the case. A few things that could be fixed is that this:
final constant A = 10;
bool foo(int i) { return i == A; }
actually index out 'A' each time the code is run. So using enums and other symbolic 'constants' is rather bad for performance.
Note that:
switch(i) case A: return true; return false;
is actually optimized above, which would perhaps have been a bug if 'A' was not final. :-)
2: The machinecode generation code is really really really bad.
It would help to inline things like
o index a string with an integer o index an array with an integer o index an object with the same index you just indexed it with (keep a small cache for indexing, have assembler code for checking the case, then a 'slow' fallback version) o 'simple' for loops
Well. I guess that's enough for now, although I initially did write a long text about lessons learned from compiling javascript to assembler. :-)
o index an object with the same index you just indexed it with (keep a small cache for indexing, have assembler code for checking the case, then a 'slow' fallback version)
I made an attempt at this a while ago but gave up when I noticed that refcounting rules for objects and functions are really messy and practically impossible to shortcut in the current model.
And yes, more revolutionary ideas would be fun to discuss sometimes. I glanced at the LLVM tutorial for integrating its SSA/code-gen libraries, and if MacRuby can use it, can't Pike?
I've got a patch for copy_svalues_recursively_no_free() that I'd like to commit. The biggest gain is that the temporary mapping is only allocated when recursing into arrays/mappings/multisets with complex elements, and we now avoid placing simple elements in the mapping where they serve no purpose. Comments welcome on any breakage that this would lead to.
(Two small bonus changes; Grubba's fix to real_allocate_array() wasn't that memory-friendly, and there was an odd add-ref/sub-ref/add-ref sequence when copying empty arrays.)
Index: svalue.c =================================================================== RCS file: /pike/data/cvsroot/Pike/7.8/src/svalue.c,v retrieving revision 1.260 diff -u -w -r1.260 svalue.c --- svalue.c 27 May 2010 23:17:09 -0000 1.260 +++ svalue.c 8 Jul 2010 00:19:42 -0000 @@ -1896,49 +1896,53 @@ { ONERROR err; int allocated_here = 0; - if (!m) { - m = allocate_mapping(num); - allocated_here = 1; - SET_ONERROR(err, do_free_mapping, m); - } while(num--) { struct svalue *tmp; + int from_type = from->type;
check_svalue_type (from); check_refs(from);
- if ((tmp = low_mapping_lookup(m, from))) { + if (from_type == T_ARRAY || + from_type == T_MAPPING || + from_type == T_MULTISET) { + /* Recursive data */ + if (m && (tmp = low_mapping_lookup(m, from))) { *to = *tmp; if (tmp->type <= MAX_REF_TYPE) add_ref(tmp->u.dummy); - to++; - from++; - continue; + } else { +#define ALLOC_DUPL_MAPPING(type_hint) \ + do if (!m && (type_hint) & BIT_COMPLEX) { \ + m = allocate_mapping(num && 1); \ + allocated_here = 1; \ + SET_ONERROR(err, do_free_mapping, m); \ + } while (0) + + if (from_type == T_ARRAY) { + struct array *ar = from->u.array; + ALLOC_DUPL_MAPPING(ar->type_field); + to->u.array = copy_array_recursively(ar, m); + } else if (from_type == T_MAPPING) { + struct mapping *ma = from->u.mapping; + ALLOC_DUPL_MAPPING(m_ind_types(ma) | m_val_types(ma)); + to->u.mapping = copy_mapping_recursively(ma, m); + } else { + struct multiset *mu = from->u.multiset; + ALLOC_DUPL_MAPPING(multiset_ind_types(mu) | multiset_val_types(mu)); + to->u.multiset = copy_multiset_recursively(mu, m); } + to->type = from_type;
- switch(from->type) - { - default: + /* Only store pair if there's a chance we'll look at it later */ + if (m && (!allocated_here || num)) + mapping_insert(m, from, to); + } + } else { *to=*from; - if(from->type <= MAX_REF_TYPE) add_ref(from->u.array); - break; - - case T_ARRAY: - to->u.array=copy_array_recursively(from->u.array,m); - to->type=T_ARRAY; - break; - - case T_MAPPING: - to->u.mapping=copy_mapping_recursively(from->u.mapping,m); - to->type=T_MAPPING; - break; - - case T_MULTISET: - to->u.multiset=copy_multiset_recursively(from->u.multiset,m); - to->type=T_MULTISET; - break; + if (from_type <= MAX_REF_TYPE) add_ref(from->u.array); } - mapping_insert(m, from, to); + to++; from++; } Index: array.c =================================================================== RCS file: /pike/data/cvsroot/Pike/7.8/src/array.c,v retrieving revision 1.227 diff -u -w -r1.227 array.c --- array.c 1 Jul 2010 09:16:05 -0000 1.227 +++ array.c 8 Jul 2010 00:19:43 -0000 @@ -80,7 +80,6 @@ ptrdiff_t extra_space) { struct array *v; - ptrdiff_t e;
if(size+extra_space == 0) { @@ -116,9 +115,11 @@ INIT_PIKE_MEMOBJ(v); DOUBLELINK (first_array, v);
- MEMSET(v->real_item, 0, sizeof(struct svalue) * size); - for(e=0;e<size;e++) { - v->item[e].type=T_INT; + { + struct svalue *item = ITEM(v); + struct svalue *item_end = item + v->size; + while (item < item_end) + *item++ = svalue_int_zero; }
return v; @@ -2432,12 +2433,14 @@ #endif
if (!a->size) { - add_ref(&empty_array); - return array_set_flags(&empty_array, a->flags); + ret = (a->flags & ARRAY_WEAK_FLAG) ? &weak_empty_array : &empty_array; + add_ref(ret); + return ret; }
ret=allocate_array_no_init(a->size,0);
+ if (m) { aa.type = T_ARRAY; aa.subtype = 0; aa.u.array = a; @@ -2445,6 +2448,7 @@ bb.subtype = 0; bb.u.array = ret; low_mapping_insert(m, &aa, &bb, 1); + }
ret->flags = a->flags & ~ARRAY_LVALUE;
Index: mapping.c =================================================================== RCS file: /pike/data/cvsroot/Pike/7.8/src/mapping.c,v retrieving revision 1.216 diff -u -w -r1.216 mapping.c --- mapping.c 31 May 2010 10:18:26 -0000 1.216 +++ mapping.c 8 Jul 2010 00:19:43 -0000 @@ -2153,6 +2153,7 @@
ret=allocate_mapping(MAP_SLOTS(m->data->size));
+ if (p) { aa.type = T_MAPPING; aa.subtype = 0; aa.u.mapping = m; @@ -2160,6 +2161,7 @@ bb.subtype = 0; bb.u.mapping = ret; mapping_insert(p, &aa, &bb); + }
ret->data->flags = m->data->flags;
Index: multiset.c =================================================================== RCS file: /pike/data/cvsroot/Pike/7.8/src/multiset.c,v retrieving revision 1.119 diff -u -w -r1.119 multiset.c --- multiset.c 28 Nov 2009 13:36:20 -0000 1.119 +++ multiset.c 8 Jul 2010 00:19:44 -0000 @@ -3745,6 +3745,7 @@ add_ref (new.msd2 = msd); SET_ONERROR (uwp, free_tree_build_data, &new);
+ if (p) { aa.type = T_MULTISET; aa.subtype = 0; aa.u.multiset = l; @@ -3752,6 +3753,7 @@ bb.subtype = 0; bb.u.multiset = new.l; mapping_insert(p, &aa, &bb); + }
node = low_multiset_first (msd); pos = 0;
On Thu, Jul 8, 2010 at 2:30 AM, Jonas Walldén @ Pike developers forum 10353@lyskom.lysator.liu.se wrote:
I've got a patch for copy_svalues_recursively_no_free() that I'd like to commit. The biggest gain is that the temporary mapping is only allocated when recursing into arrays/mappings/multisets with complex elements, and we now avoid placing simple elements in the mapping where they serve no purpose. Comments welcome on any breakage that this would lead to.
Do you have any numbers showing it gets faster?
The first program below illustrates the overhead of array copying and zeroing:
int fn1() { array arg = ({ "a", "b", "c" }); return sizeof(arg); }
int fn2() { array arg = allocate(4711); return sizeof(arg); }
void main() { int dummy; int t0 = gethrvtime(); for (int i = 0; i < 100000; i++) dummy += fn1(); int t1 = gethrvtime(); for (int i = 0; i < 100000; i++) dummy += fn2(); int t2 = gethrvtime(); werror("fn1: %d ms\n", (t1 - t0) / 1000); werror("fn2: %d ms\n", (t2 - t1) / 1000); }
Sample test run (I repeated many times and it was +/- 2 msec):
BEFORE AFTER --------------------- --------------------- fn1: 159 ms fn1: 36 ms -77% fn2: 1194 ms fn2: 777 ms -34%
Next, a large XSL transform which exercises a lot of complex data structures (again repeatable across several runs):
BEFORE AFTER --------------------- --------------------- Parse XML: 82 ms Parse XML: 80 ms Parse XSL: 1 ms Parse XSL: 1 ms Transform: 289 ms Transform: 246 ms -15% Total: 373 ms Total: 328 ms
Another heavy XSL transform (less data, more complex rules):
BEFORE AFTER --------------------- --------------------- Parse XML: 17 ms Parse XML: 17 ms Parse XSL: 4 ms Parse XSL: 4 ms Transform: 194 ms Transform: 176 ms -9% Total: 217 ms Total: 198 ms
All compiled with gcc 4.2.1 and run on a 2 GHz quad-code x86_64 Xeon with 1.33 GHz system bus and 667 MHz DDR2 RAM.
So, something like this then:
--- /dev/null +++ Tools.pmod/Shoot.pmod/ArrayCopy.pike 2010-07-08 13:21:47.908766180 +0200 @@ -0,0 +1,17 @@ +#pike __REAL_VERSION__ +inherit Tools.Shoot.Test; + +constant name="Array Copy"; + +void perform() +{ + int dummy; + for (int i = 0; i < 100000; i++) + dummy += arraycopy(); +} + +int arraycopy() +{ + array arg = ({ "a", "b", "c" }); + return sizeof(arg); +}
--- /dev/null +++ Tools.pmod/Shoot.pmod/ArrayZero.pike 2010-07-08 13:16:35.179720588 +0200 @@ -0,0 +1,17 @@ +#pike __REAL_VERSION__ +inherit Tools.Shoot.Test; + +constant name="Array Zero"; + +void perform() +{ + int dummy; + for (int i = 0; i < 100000; i++) + dummy += arrayzero(); +} + +int arrayzero() +{ + array arg = allocate(4711); + return sizeof(arg); +}
Feel free to add the rest of your test to the Shootout before adding the optimizations so Pikefarm can track how the diffrent platforms react to it.
In general: If people have performance tests, add them to the benchmark. It's immensly helpful to track regressions and is only run by Pikefarm, so you do not risk breaking anything important by checking in all kinds of experimental stuff there.
Good suggestion, I wasn't aware Pikefarm ran benchmarks all the time. Are there any tools on the site to chart the results?
Unfortunately not. It's been on the list of things that someone(tm) should do since forever.
Someone(tm) could probably hack together some basic graphs based on benchmark.txt in an hour, but getting something that feels complete probably takes a day or two.
On Fri, Jul 09, 2010 at 08:45:02AM +0000, Peter Bortas @ Pike developers forum wrote:
Unfortunately not. It's been on the list of things that someone(tm) should do since forever.
Someone(tm) could probably hack together some basic graphs based on benchmark.txt in an hour, but getting something that feels complete probably takes a day or two.
are the results logged somewhere? (where?)
greetings, martin.
On the result page for each _recent_ build there is a link to "benchmark.txt". Unfortunately that file is deleted together with all other result files when it get's old.
http://pike.ida.liu.se/development/pikefarm/7.8.xml -> Click status light for (welcome.linkeo.intr:build 1094) => http://pike.ida.liu.se/development/pikefarm/result.xml?id=1094_261&pike=... -> Click "benchmarkj.txt" => http://pike.ida.liu.se/generated/pikefarm/7.8/1094_261/benchmark.txt
If you want to automate it without all the clicking the URL consists of
base = http://pike.ida.liu.se/generated/pikefarm/ pikever = 7.8/ buildnr = 1094_ hostnr = 261 file = /benchmark.txt
You'll still have to scrape the main page for host details and new hosts now and then though.
This weekend I did som hacking for fun and created something that fetched all benchmarks and showed it in a table.
However, the server I was going to use it on was running old cruft. As soon as I have an environment to set it up on I can show it.
Nice! Does "_261" mean that it's based on the Darwin build box? Also, what determines the column order?
Yes, 261 is welcome.linkeo.intra.
I have the other hosts downloaded and parsed as well but I could not display them all at once, so at the moment its hardcoded to show that host.
Column order, probably the order the list of directories is returned fron the readdir API. I guess you want the columm header to only display the build number and sorted with the highest build number to the left. No problem to fix that.
Unfortunately a benchmark file does not have any date information. That would be nice. To get that I need to scrub the page for that.
Next to fix is that the server scrubs the page for new benchmark periodically. At the moment it just do that at start up. It was too tired to fix that.
Here be fancy javascript. In my FF 3.6 I just get a title "ShootPike" and nothing else..?
Yes, works. Was just about to mention the error message about the console console when it disappeared.
Sometimes debugging code is included in the commits. Probably the most common bug in frontend coding next to spellos. :-)
In the last episode (Jul 12), Martin Stjernholm, Roxen IS @ Pike developers forum said:
Here be fancy javascript. In my FF 3.6 I just get a title "ShootPike" and nothing else..?
There's a call to console.log() which means Tor has firebug installed :) After enabling firebug, I see that the raw data gets fetched but the table isn't built. I don't know enough YUI 3 yet to know what causes it to fail. Tor?
In Firebug HTML tab, do you something like this
<body class="yui-skin-sam"> <h1>ShootPike</h1> <div id="table" class="yui-dt"> <div class="yui-dt-mask" style="display: none;"></div> <table summary=""> <colgroup> <thead> <thead> <tbody> </table>
etc...
In the last episode (Jul 12), Tor Edvardsson @ Pike developers forum said:
In Firebug HTML tab, do you something like this
<body class="yui-skin-sam"> <h1>ShootPike</h1> <div id="table" class="yui-dt"> <div class="yui-dt-mask" style="display: none;"></div> <table summary=""> <colgroup> <thead> <thead> <tbody> </table> etc...
Initially all I saw was the <div id="table"></div> block, but it looks like it's working now.
It's a good start. Besides the obvious things, it'd be cool to see some sort of visualization when a measurement has gone up or down a lot. Maybe through the background color, e.g. use a continuous scale where half the time is deep green, same is white, and double time is deep red.
That would be nice and probably not impossible to do.
I was thinking of also include a graph that can pop up i you click the test name.
Thanks. I wanted to double-check the numbers before saying that you should switch from "total" to "user" in the scraping code. The total value includes the launch time of Pike itself and is irrelevant to the particular benchmark that is run.
One thing about the benchmarks: The runtime is more or less constant for most of them, since the benchmark frameworks tries to normalize it.
It's more relevant to count the number of X per second (the last column in the benchmark output is the number of X).
Both the "total" and "user" figures compensates for that though, so they are meaningful. Looking at the number of runs otoh can give a very low number of significant digits for long-running tests, e.g. "1 runs/s".
Well, yes, I was more thinking along the lines of dividing by the number of runs. But if it's already done, it's all good.
The patch looks ok to me, except that this bit appears unnecessary:
+ /* Only store pair if there's a chance we'll look at it later */ + if (m && (!allocated_here || num)) + mapping_insert(m, from, to);
The inserts that matter take place just after allocation in the copy_*_recursively functions. (The corresponding mapping_insert in the old code was apparently just as unnecessary.)
Exactly what I planned, though after Mast's note I might want to wait for that to be resolved as well.
As it turns out from the testsuite, there's one problem with this optimization: When the duplicate-tracking mapping is allocated, it's not propagated upward so that the same instance in another part of the tree is detected.
I can see two ways to fix it: Either by making a double indirection for the mapping pointer, or by going back to allocating the mapping at the top. The latter isn't as silly as it first appears, since only a single fixed-size mapping struct is allocated as long as no inserts are done during the process. It might beat the overhead of the extra indirection and the additional checks required.
Right, I had considered your first case already and it is safe. Passing the mapping by reference would however be good anyway since that would avoid repeating the test for complex content but I never completed that solution since ONERROR setup/cleanup got messy.
What happened to the idea to change the hash tables sizes to powers of 2? As far as I remember this was mentioned on the list long ago. How well are the hash values distributed by the current hashing algorithm? Maybe it needs to be replaced by a different one? Are there any other side-effects aside from breaking compatibility for hash()?
arne
I've brought that up earlier and I still get thumbs-down from other people regarding such changes in the 7.8 branch. In my opinion a variant which is overall faster but where the slowest cases happen to shift to new places is a reasonable optimization even in the current branch but I'm not making the call.
One very easy fix that I'd like to push through is to remove an odd restriction that identifier caching only happens in objects with >= 9 identifiers. (See FIND_FUNCTION_HASH_TRESHOLD in program.c.) I don't see any rationale for why smaller objects shouldn't benefit from that cache at all; if it's a hot path it should be cached. Granted the risk for collisions increase with more identifier entries in total but I guess that is offset by getting more cache hits during a particular program action before the working set changes.
Yes I thumbed it down.
7.8 is supposed to be stable, afterall. That means bug fixes is supposed to happen there, nothing else. And optimizations aren't bug fixes. Albeit this rule is broken frequently (by myself as well), it still ought to be kept in mind. Non-bugfixes can be fine if they don't affect any existing functionality (that's why I e.g. thought it was ok to add Standards.JSON), but optimizations aren't like that.
Sure it's annoying that 7.9 never opens up, but bending this rule isn't the answer to that problem. And it is afterall possible to commit stuff without doing it in 7.8: The pikex git repo works, for instance; my hacks are on the mast/sneak-7.9 branch.
When it comes to breaking the bugfixes-only rule, it should be done with care at least. Small well-localized optimizations that are fairly simple to verify, like the copy_value() optimization, are easier to swallow. Tweaking hash algorithms isn't like that - it might give O(n) in some situation out there. It's unlikely, but even a slim risk is imo not acceptable in a supposedly stable version. We know what we have, but if we don't know what we'll get then it should not happen in 7.8.
/.../ remove an odd restriction that identifier caching only happens in objects with >= 9 identifiers. /.../
I guess the reasoning behind that is that the hash table doesn't give any appreciable speedup when the identifiers are very few, and so it's better to not pollute it with those. Whether it's correct or not I don't know. The proper thing to do would be to test both with a lower limit and without it.
Anyway, that's a tweak I'd find easier to accept.
I understand your reasoning but disagree that the worst-case effects of this type of change warrants a 7.9. But if we get a 7.9 soon I'd be more positive; maybe we can develop there and backport if it turns out to be an improvement we want to distribute asap.
Regarding the identifier cache, is the cost for a cache miss proportional to the number of identifiers in an object? That could be a possible reason for excluding them.
That's more of a theoretical problem at this point IMHO. We could go with the GIT repository we have today and no one would notice any problems with it.
For a list of known remaining problems, so that you can determine the notacibility for yourself, please see http://pike-svn.lysator.liu.se/twiki/bin/view/Main/RepositoryConversion and http://pike-svn.lysator.liu.se/twiki/bin/view/Main/AutomaticComparisonResults.
However, we are also waiting for the release of git 1.7.2, since the current stable git does not work with the converted repository.
does working on 7.9 in git now require the repository to be converted? a 7.9 branch could be opened now and then once the conversion is ready, a single rebase should be enough to integrate the new work with the existing history.
greetings, martin.
You are free to make your own branches right away, and rebase them against 7.9 once it exists.
Regarding the identifier cache, is the cost for a cache miss proportional to the number of identifiers in an object? That could be a possible reason for excluding them.
It depends; for a "fixed" program (ie fully compiled) it's a binary search (cf program.c:low_find_shared_string_identifier()), for programs that are being compiled it's a linear search (cf program.c:really_low_find_shared_string_identifier()). Having the threshold for the lookup cache at 9 identifiers seems reasonable.
We know what we have
I can't let this one pass. :-) How do you know when the existing implementation gives repeated collisions? I assume program IDs are pretty unstable, particularly in Roxen where module loading can happen anytime. And since collisions cannot be diagnosed from the application level I doubt anyone can say they depend on the current behavior.
I know the existing implementation has been in active use in many installations for a considerable time, and there are no known unresolved cases of bad hashing in it. That's as good as it can get (short of a theoretical proof), and that's what code stability is all about.
pike-devel@lists.lysator.liu.se