I think I managed to fix the last issue. I was somehow confusing things and removed the locals from the stack before unlinking the stack frame. This of course broke trampolines. I also ended up rebasing the branch to get rid of the reverts I did at some point.
The current state passes the testsuite (the same tests as 8.1 at least). Performance wise it is roughly where 8.1 is, except for map/automap being significantly faster. There are some slowdowns currently, which are due to me removing some fast paths from the F_CALL_OTHER opcode. I will look into that.
I readded most of the tracing code, however, some of it is unfinished and DTrace is probably broken. I have also not looked at PROFILING, yet, that is probably also not right yet.
Sidenote: Profiling unfortunately does not work properly when fork()ing because timers change. It might even crash when running with debug mode because of that. But that is probably just a bug we need to fix.
Whats currently left on my list before proposing to merge it into 8.1/8.3
* Make sure the map/automap optimizations do not break in pathological cases (e.g. objects being destructed or similar). * Maybe think about the API again (e.g. callsite_execute and callsite_return could be merged. same with callsite_init/callsite_set_args).
Otherwise I played around with adding frame caching to apply_array, which looks promising performance wise. However, it takes some attention to make sure the stack traces are always correct. This would be a good test-case for caching frames in general.
Anyway, feedback welcome, as usual,
Arne
On 02/22/17 09:37, Arne Goedeke wrote:
I am not quite sure, since I did not have the time to look into it, yet. My feeling is that callsite_reset() is currently broken, probably when trampolines are used. Its probably easy to fix. I was also planning to write a couple of tests which try to cover all possible paths of the function call API. Having to run the full testsuite can be quite annoying..
I also started adding some benchmarks for function calls to the pike-benchmark repo. That might make it easier to tweak specific optimizations.
Arne
On 02/21/17 22:12, Martin Karlgren wrote:
Hi Arne,
Alright. Any idea what the crash might be related to?
I’ve pushed the marty/call_frames branch now. As mentioned, something breaks when precompiled bytecode is decoded, so many testsuite tests will segfault (since they are precompiled).
Compiling --with-mc-stack-frames and running the very nice Debug.generate_perf_map() (previously implemented by TobiJ) should enable perf to extract what’s needed. I’ve used https://github.com/jrfonseca/gprof2dot https://github.com/jrfonseca/gprof2dot and http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html for visualisation.
/Marty
On 21 Feb 2017, at 20:31 , Arne Goedeke el@laramies.com wrote:
Hi Marty,
thanks!
Yes, low_mega_apply still needs to be refactored. It is slightly more "complicated" because of APPLY_STACK, where the return value will overwrite the function on the stack. I want to fix the last crash in the testsuite before refactoring that. If you are interested in working on those, just let me know so we don't both do it ;)
Adding more perf support would be great, do you have your code in a branch somewhere? I would be interested to have a look at it.
Arne
On 02/20/17 23:47, Martin Karlgren wrote:
Hi Arne,
That’s awesome!
I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?
Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why. Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent. Examples from a very simple Roxen site being hit by apachebench: http://marty.se/dotgraph.png http://marty.se/dotgraph.png (nodes with a “perf-17628.map” header represent Pike functions) http://marty.se/flamegraph.svg http://marty.se/flamegraph.svg (time on horisontal axis, stack depth on vertical axis).
Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.
/Marty