I'm a bit curious about the string hashing algorithm used in Pike.
What algorithm is it based on and what complexities does it exhibit? Was it selected at random or is there some rationale behind it..?
The algorithm is basically an improvement on the good old:
int hash(char *x) { int ret=0; while(*x) { ret+=ret<<5 + *x; x++; } }
The primary improvements are:
1) only hash the beginning of the string (and the length of the string) 2) dynamically calculate how much of the string needs to be hashed for to avoid hash duplicates 3) Replace "5" with different numbers for different characters in the string.
Not sure what you mean by "complexeties", but the string handling in Pike is linear except in the most extreme circumstances. As far as rationale goes: There is a reason for everything in Pike :)
Is there a paper on this method of hashing or some sort of documentation of it's properties?
I'd like to improve our current shared-string scheme (at work) in a similar fashion, but I'm reluctant to do so without some sort of proof that it's a "safe" way of doing things. I'm especially intrested in the properties of hash(string[0..x] + strlen(string)) (where x<<strlen(string)) vs hash(string).
Anyone who can comment on how x is to be choosen?
First you can document the exact algorithm. Look at DO_HASHMEM in pike_memory.h, called from low_do_hash in stralloc.c.
Then for the empirical approach, modify your pike to log all new strings and then run some big applications and see what the real performance in terms of collisions are compared to a theoretical optimum.
The analytical approach is more difficult, since you first have to create a probability model for how strings look and then run that model through the algorithm.
Sounds like fun. At the very least I would like to read the report...
Yes, that may be a fun exercise, but I'm more intrested in something where I don't have to spend a day or two testing stuff and even more time analysing the data. A quick and dirty test approach might be possible, but nothing major. We simply have too much to do right now.
Ideally, x should be choosen dynamically to fit your data. In Pike, x is increased if the string hash is found to be inefficient. (By measuring how many iterations it takes to find a string.)
pike-devel@lists.lysator.liu.se