Does anyone know how often in the code we actually depend on the fact that the same string will be at the same address in memory?
Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory.
I.e. something along the lines of all strings shorter than stringmin will always be optimised to a single reference, and all strings above that *might* have more than one reference, but not necessarily do (i.e. they're not fully hashed all the time, to avoid the overhead of rehashing large strings repeatedly when juggling around lots of strings).
All the places that depend on same string = same address would need to be patched. Also, to determine stringmin, some profiling of existing apps would be interesting. Is that statistic available for say Roxen, to know the distribution of string length and reference count in a running application?
On Thu, 29 Mar 2012, Stephen R. van den Berg wrote:
Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory.
I guess the point here is to skip the hashing in cases where the strings are large, come from the network/disk and/or are very unlikely to exist twice. Would it not make more sense to allow for using unhashed strings explicitly, instead? And in that case I think it would be better to have a seperate class for that. Otherwise all kinds of code would become much more complex. Think mapping lookup and similar places, where the ptr of the string is used.
Then, of cause, all kinds of other places need to be changed to support "new" unhashed strings, otherwise they would be quite useless, except for very special situations where some cycles can be saved.
Arne Goedeke wrote:
On Thu, 29 Mar 2012, Stephen R. van den Berg wrote:
Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory.
I guess the point here is to skip the hashing in cases where the strings are large, come from the network/disk and/or are very unlikely to exist twice. Would it not make more sense to allow for using unhashed strings explicitly, instead? And in that case I think it would be better to have a seperate class for that. Otherwise all kinds of code would become much more complex. Think mapping lookup and similar places, where the ptr of the string is used.
That's exactly what I'm asking... How many places are there where we explicitly depend on the fact that the address can be used to define uniqueness? I'm trying to assess if it is doable to fix those.
Then, of cause, all kinds of other places need to be changed to support "new" unhashed strings, otherwise they would be quite useless, except for very special situations where some cycles can be saved.
Quite. The reasoning has come full circle, which is why I'm trying to see if it can be retrofitted to make the speedup available to all unaltered pike programs.
That's exactly what I'm asking... How many places are there where we explicitly depend on the fact that the address can be used to define uniqueness?
All places where strings are compared.
Say, a few thousand places in the code, probably?
Most importantly: Mappings and multiset, identifiers in objects and programs, switch and string comparison functions.
Especially things like object and program identifiers, switch mappings and multisets would be dead slow without this optimization.
And hashing a string takes basically no time at all.
Have you measured it to take a lot of CPU somewhere?
-- Per Hedbor
The issue isn't necessarily the hashing but the fact that you need to have this globally synced instead of e.g. creating a thread-local string pool. Still, I agree with you that fundamental properties of mappings etc are based on string uniqueness. There are other low-hanging fruit that should be targeted first imho.
The issue isn't necessarily the hashing but the fact that you need to have this globally synced instead of e.g. creating a thread-local string pool.
Well. Yes, but as long as we do not have actual threads that can run concurrently in pike this is not much of an issue, really.
C-code can (and does) work with non-shared strings, and it is only when the data is moved to the pike layer that it needs to be shared.
This is done without locks since we have the "nice" global interpreter lock, and the fact that there is only one shared string pool is thus not really all that important.
Even if we were to get rid of the global lock, a global hash table for strings probably wouldn't be a significant problem since it can be made lock free.
(i.e. they're not fully hashed all the time, to avoid the overhead of rehashing large strings repeatedly when juggling around lots of strings).
Large strings are not fully hashed. The hash function will consider at most 72 characters. So strings longer than that will not take longer to hash than a string of 72 characters.
Does anyone know how often in the code we actually depend on the fact that the same string will be at the same address in memory?
Often, but it's probably not hard to find a set of gatekeeper functions that cover all the cases.
Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory.
I.e. something along the lines of all strings shorter than stringmin will always be optimised to a single reference, and all strings above that *might* have more than one reference, but not necessarily do (i.e. they're not fully hashed all the time, to avoid the overhead of rehashing large strings repeatedly when juggling around lots of strings).
I did some preparations for this ~6 years ago (cf commit 3788c640).
Note that it typically isn't the calculation of the hash that is expensive, but the comparison on hash-hit.
All the places that depend on same string = same address would need to be patched. Also, to determine stringmin, some profiling of existing apps would be interesting. Is that statistic available for say Roxen, to know the distribution of string length and reference count in a running application?
Note also that in Pike the string hash function is parameterized, and the parameters are changed depending on how the hashtable is balanced. All of the statistics used for the balancing of the hashtable currently aren't visible at Pike-level, and they differ somewhat between different versions of Pike.
/grubba
Note that it typically isn't the calculation of the hash that is expensive, but the comparison on hash-hit.
Do you have some statistics on this? I'd imagine that most time spent on comparising hash-hits would be on short to medium length strings, and not really long ones since it's unlikely you'll find another string with the exact same length in the hash bucket.
Do you have some statistics on this? I'd imagine that most time spent on comparising hash-hits would be on short to medium length strings, and not really long ones since it's unlikely you'll find another string with the exact same length in the hash bucket.
I do have some statistics from the Opera Mini servers.
We spent around 0.2% of the CPU time in string hashing before the CRC32 change. After that we spend around 0.1%CPU time there.
Almost all time is spent traversing linked list, rescaling the whole process to 100% gives 10% to crc32 and and 1.8% to the string comparisons.
87% is spent in the has lookup + loop.
If it was not clear, before CRC32, almost 50% of share-string time was spent in the hash function. But it was still a very small percentage of the total CPU time used.
I optimized it because it was easy to do, mostly.
We spend significantly more time looking up identifiers in objects, as an example, so if you want to optimize pike make this code not call so many slow C-functions:
class foo { int cnt; void test() { for( cnt=0; cnt<1000000; cnt++ ); } }
void main() { foo()->test(); }
-- Per Hedbor
Do you have any statistics about what the hit rate of the identifier lookup cache is? How many identifier lookups actually use the binary search?
On Thu, 29 Mar 2012, Per Hedbor () @ Pike (-) developers forum wrote:
If it was not clear, before CRC32, almost 50% of share-string time was spent in the hash function. But it was still a very small percentage of the total CPU time used.
I optimized it because it was easy to do, mostly.
We spend significantly more time looking up identifiers in objects, as an example, so if you want to optimize pike make this code not call so many slow C-functions:
class foo { int cnt; void test() { for( cnt=0; cnt<1000000; cnt++ ); } }
void main() { foo()->test(); }
The problem is that there may be work patterns where there's a significant risk of getting very long identical strings, e.g. if a file is read and cached for some time and then read again from another part of the program, or if the same file is read concurrently by different threads.
What's worrying is that it could occur suddenly in systems that has been running smoothly for a long time.
pike-devel@lists.lysator.liu.se