Pike string hashing

List overview All Threads
Download

newer

older

hilfe bug?

class( ({"hello"})[*]) ?

Marcus Agehall (PacketFront) ＠ Pike (-) developers forum

2 Nov 2004 2 Nov '04

9:05 a.m.

I'm a bit curious about the string hashing algorithm used in Pike.

What algorithm is it based on and what complexities does it exhibit? Was it selected at random or is there some rationale behind it..?

Show replies by date

Fredrik (Naranek) Hubinette (pappa) ＠ Pike (-) developers forum

22 Nov 22 Nov

3:10 a.m.

The algorithm is basically an improvement on the good old:

int hash(char *x) { int ret=0; while(*x) { ret+=ret<<5 + *x; x++; } }

The primary improvements are:

1) only hash the beginning of the string (and the length of the string) 2) dynamically calculate how much of the string needs to be hashed for to avoid hash duplicates 3) Replace "5" with different numbers for different characters in the string.

Not sure what you mean by "complexeties", but the string handling in Pike is linear except in the most extreme circumstances. As far as rationale goes: There is a reason for everything in Pike :)

Marcus Agehall (PacketFront) ＠ Pike (-) developers forum

8:50 a.m.

Is there a paper on this method of hashing or some sort of documentation of it's properties?

I'd like to improve our current shared-string scheme (at work) in a similar fashion, but I'm reluctant to do so without some sort of proof that it's a "safe" way of doing things. I'm especially intrested in the properties of hash(string[0..x] + strlen(string)) (where x<<strlen(string)) vs hash(string).

Anyone who can comment on how x is to be choosen?

Martin Nilsson (DivX Networks) ＠ Pike (-) developers forum

1:40 p.m.

First you can document the exact algorithm. Look at DO_HASHMEM in pike_memory.h, called from low_do_hash in stralloc.c.

Then for the empirical approach, modify your pike to log all new strings and then run some big applications and see what the real performance in terms of collisions are compared to a theoretical optimum.

The analytical approach is more difficult, since you first have to create a probability model for how strings look and then run that model through the algorithm.

Sounds like fun. At the very least I would like to read the report...

Marcus Agehall (PacketFront) ＠ Pike (-) developers forum

3:30 p.m.

Yes, that may be a fun exercise, but I'm more intrested in something where I don't have to spend a day or two testing stuff and even more time analysing the data. A quick and dirty test approach might be possible, but nothing major. We simply have too much to do right now.

Fredrik (Naranek) Hubinette (pappa) ＠ Pike (-) developers forum

20 Apr 20 Apr

5:35 a.m.

Ideally, x should be choosen dynamically to fit your data. In Pike, x is increased if the string hash is found to be inefficient. (By measuring how many iterations it takes to find a string.)

7415

Age (days ago)

7584

Last active (days ago)

pike-devel@lists.lysator.liu.se

5 comments

3 participants

tags (0)

participants (3)

Fredrik (Naranek) Hubinette (pappa) ＠ Pike (-) developers forum
Marcus Agehall (PacketFront) ＠ Pike (-) developers forum
Martin Nilsson (DivX Networks) ＠ Pike (-) developers forum