Does anyone know how often in the code we actually depend on the fact that the same string will be at the same address in memory?
Often, but it's probably not hard to find a set of gatekeeper functions that cover all the cases.
Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory.
I.e. something along the lines of all strings shorter than stringmin will always be optimised to a single reference, and all strings above that *might* have more than one reference, but not necessarily do (i.e. they're not fully hashed all the time, to avoid the overhead of rehashing large strings repeatedly when juggling around lots of strings).
I did some preparations for this ~6 years ago (cf commit 3788c640).
Note that it typically isn't the calculation of the hash that is expensive, but the comparison on hash-hit.
All the places that depend on same string = same address would need to be patched. Also, to determine stringmin, some profiling of existing apps would be interesting. Is that statistic available for say Roxen, to know the distribution of string length and reference count in a running application?
Note also that in Pike the string hash function is parameterized, and the parameters are changed depending on how the hashtable is balanced. All of the statistics used for the balancing of the hashtable currently aren't visible at Pike-level, and they differ somewhat between different versions of Pike.
/grubba