Would it be OK to merge the split string code to pike 8.1?
I have done the merge to a private branch, and now wonder if it would be OK to push it upstream.
It splits strings into two parts: The pike_string structure and the actual string.
There are at least two advantages that are available right now:
1: "static" strings (C-strings and static C-data) can be converted to pike strings without allocating any memory for the string data. This actually saves measurable amount of string bytes.
Also, these strings are faster to create since there is no need to allocate storage for the actual string data (useful as an example in GTK2, there is an actual measurable speedup in the init of that function. Well. microseconds, but it's faster. :))
2: It actually improves the speed of some string operations, probably because the string metadata is closer in memory (more likely to be cached).
The disadvantages are, at least:
1: one more pointer in the string struct. This is, however, balanced for everythinge except short strings by the fact that malloc is not used to allocate them, so there is one less size_t in them (malloc overhead)
2: string data is individually mallocd (see 1), so here we get the overhead back, except for short strings. :)
So, basically: 8 bytes more overhead for strings.
As for the speed of string creation/free:ing:
Without patch:
| String Creation . . . . . . . . . . . . . . . . . . 2207k/s | String Creation (existing) . . . . . . . . . . . . 8335k/s
With patch:
| String Creation . . . . . . . . . . . . . . . . . . 3069k/s | String Creation (existing) . . . . . . . . . . . . 8771k/s
Seen this way the split-method is faster for basic string operations. The numbers, however, vary so much that it's basically a wash.
Another advantage with this method is that it is reasonably trivial to add new string types. As an example that I am seriously considering: A string that is a tail port of another string.
The tail-string would be used for this somewhat unfortunate but rather common coding pattern:
sscanf( str, "%2c%s", part, str );
and, in a similar manner:
len = file->write(str); str = str[written..]
These are both O(n^2), but if tail-of string are added they could both be about O(n), only reallocate the data if more than 50% is wasted.
Unfortunately the same can not be done for random substrings, since strings are guaranteed to be 0-terminated. But that is probably not much of an issue in practice.
Also, when gc:ing code could be added to "trim" all tail-of strings.
There is at least one slightly incompatible change to the C-level API:
push_constant_text and the related macros (create a static pike_string the first time it's called, and then use that from now on) now require actual static lifetime of the argument passed to it.
This is already the case for all code in the pike tree, but external modules might break this assumption.
Well, if the argument of push_constant_text() doesn't have static lifetime, then how can it be a constant at all? I fail to see a sane usecase that would break from that...
I agree, but consider something like this:
PIKECLASS FooConnection { foo_handle *con;
INIT { .. set con .. } EXIT { .. release con .. }
PIKEFUN string library_version() { push_constant_text(foo_library_version(THIS->con)); } }
In this case it might be expected that the value returned from foo_library_version will not change, but perhaps it is free:d when exit is called on the class.
But, yes, push_constant_text should not be used for this case, no.
THIS->con is not constant, so then foo_library_version(THIS->con) can't be assumed to be constant either, no. Should be push_text().
On 08/18/15 22:13, Per Hedbor () @ Pike (-) developers forum wrote:
Would it be OK to merge the split string code to pike 8.1?
I have done the merge to a private branch, and now wonder if it would be OK to push it upstream.
Could you push your merge as a branch? I recall that the code had a possible memory leak when allocating small strings in case an out of memory error happened. There are possibly also some other issues. At the time I realized that many of the APIs (like add_string_constant and others) make it impossible to correctly make use of static strings. As you point out, the current state could break with some external modules. Maybe the best alternative is to actually add an api, instead of requiring that the const char* arguments have program livetime.
There is a somewhat major difference between add_string_constant (not static) and push_constant_text (static + constant.:))
But I can agree that the string function names are rather messy. It is a rather big job to change them all to be consistent, however.
On Sat, Aug 22, 2015 at 11:52 AM, Arne Goedeke el@laramies.com wrote:
On 08/18/15 22:13, Per Hedbor () @ Pike (-) developers forum wrote:
Would it be OK to merge the split string code to pike 8.1?
I have done the merge to a private branch, and now wonder if it would be OK to push it upstream.
Could you push your merge as a branch? I recall that the code had a possible memory leak when allocating small strings in case an out of memory error happened. There are possibly also some other issues. At the time I realized that many of the APIs (like add_string_constant and others) make it impossible to correctly make use of static strings. As you point out, the current state could break with some external modules. Maybe the best alternative is to actually add an api, instead of requiring that the const char* arguments have program livetime.
There is a somewhat major difference between add_string_constant (not static) and push_constant_text (static + constant.:))
But I can agree that the string function names are rather messy. It is a rather big job to change them all to be consistent, however.
On a somewhat related note, I noticed that the string structure no longer has extra padding bytes available, so adding new string types (substring) is not possible without increasing it's size, or decreasing the maximum allowed string length to 2Gb again (or going through all the signed/unsigned work and changing it to at least 4Gb).
Ok.. I accidentally commited about half of the changes to 8.1, without the other half (I did a 'git push' but aborted almost directly when I noticed it was pushing 8.1. However, it seems aborting does not stop the commiting, and in this case a lot of the commits were not actually included).
To fix this I just did a merge of the string split to 8.1. I assume there are alternative solutions, but there had been other commits done as well, and the merged commits are spread in time (some are merged), so in order to fix it quickly I just basically did the full commit.
On a somewhat related note, I noticed that the string structure no longer has extra padding bytes available, so adding new string types (substring) is not possible without increasing it's size
As it turs out, by shrinking the size_shift field to 2 bits there are 6 bytes left for types instead of the previous 2. So now a lot of new allocation types could be added, in theory.
I created the substring type on a branch, it does help with the naive cut-prefix-of-string buffer code case, at least. :)
Making size_shift a bitfields add an compiler-generation &3 to each access of it, more or less. This could cause a noticeable general slowdown, I have not checked if this is the case.
With patch:
string s = random_string(1024*1024); set format bench lambda(string s){while(strlen(s))s=s[1..];}(s);
Compilation: 549ns, Execution: 83.55ms
Without patch:
Compilation: 505ns, Execution: 35.16s
So, for 1Mb it's 400+ times faster. Of course, it's the ordo that is faster, the actual string_slice substring generation code is somewhat slower (for small values of 'somewhat').
I have now run a somewhat more extensive test (run the benchmark for 20 hours, alternating with and without the substring type).
Even with this long a run of the testsuite there seems to be a few %-agepoints of variance, which is impressive. The code is identical except for the change to making the string type a bitfield, I would have expected most tests to be 100.0% after 20 hours of sampling.
Anyway, these are the results:
There are three tests with a large difference, and one with a should-be-relevant difference that I can not see how it occurs, however.
1: Append array, operating at 96%. That is a fairly noticeable slowdown compared to most tests. But I can not for the life of me see why it occurs. Bad luck with cache-line alignment of the code? Who knows.
2: Array & string juggling. 36x faster. It would seem that this test does while (sscanf(s,"%d %s",x,s)==2) sum+=x,ia+=({x});
so, this is indeed using the substring code.
3: array_sscanf tag removal, operating at 86% speed. This is almost certainly due to the change in sscanf (use string_slice to make strings when possible. As it turns out it is not possible in this case, but I had assumed gcc would see that (it is statically true in the code that the string argument is null).
Well.
4: Tag removal using sscanf, operating at 442% (closing in on Parser.HTML speed). This is due to the substring optimization, since it removes one tag at a time using sscanf.
-------------------------------------------------------- Test Relative speed -------------------------------------------------------- Ackermann : 106% Adding element to array (global) : 103% Adding element to array (local) : 100% Adding element to array (private global) : 102% 1 Append array : 96% Append mapping (+) : 100% Append mapping (|) : 100% Append multiset : 101% 2 Array & String Juggling : 3669% Array Copy : 102% Array Zero : 101% Binary Trees : 100% Clone null-object : 102% Clone object : 100% Compile : 100% Compile & Exec : 101% Foreach (arr,global) : 101% Foreach (arr,local) : 99% Foreach (arr;local;global) : 100% Foreach (arr;local;local) : 97% GC : 104% Huffman : 100% Huffman (binary) : 101% Insert in array : 100% Insert in mapping : 102% Insert in multiset : 102% Loops Nested (global) : 104% Loops Nested (local) : 100% Loops Nested (local,var) : 100% Loops Recursed : 100% Matrix multiplication (100x100) : 100% Read binary INT128 : 98% Read binary INT16 : 104% Read binary INT32 : 102% Replace (parallel) : 99% Replace (serial) : 103% Simple arithmentics (globals) : 102% Simple arithmentics (private global) : 100% Simple arithmetics (locals) : 100% Sort equal integers : 101% Sort ordered integers : 100% Sort unordered integers : 101% Sort unordered objects : 103% String Creation : 99% String Creation (existing) : 98% String Creation (wide) : 100% Tag removal u. Parser.HTML : 99% Tag removal u. Regexp.PCRE : 98% 3 Tag removal u. array_sscanf : 86% Tag removal u. division : 101% Tag removal u. search : 100% Tag removal using a loop : 98% 4 Tag removal using sscanf : 442% Upper/lower case shift 0 : 98% Upper/lower case shift 1 : 101% call_out handling : 99% call_out handling (with id) : 98%
So... I guess the question now is: Should I merge this as well? I will look into the array_sscanf code, but sscanf is so macrified it is sort of hard to solve it easily, I think. I guess I could pass the code to create a substring as a macro.. ;)
Ok, #3 is fixed now. So, the testsuite is on average faster. In some cases significantly so, for obvious reasons.
However, that does not really say much, except that there is no drastic slowdown.
pike-devel@lists.lysator.liu.se