There is a problem with the hash function currently used for floats. It multiplies by some number and casts the result to an UINT32. This will lead to many floats having the same hash value 0. Especially bad case:
hash_values(random(Float.MAX))
Also from those results I would say that random is broken for floats. It will always produce values lying within a few orders of magnitues in the above case. There is many options for good hash functions for floats/ints. Any preferences?
arne
Shouldn't random produce a linear result between 0 and the value? If the value is 1e308, it's 1% chance that you get a value below 1e306, and 0.1% chance that you get a value below 1e305. It seems correct to me.
Yes, you are right. For the hashing though, i think it would be sane to do some kind of mixing of all 64 bits into a 32 bit integers. Something along the lines of martins proposal. Maybe even use the same for 64 bit integers aswell.
On Mon, 7 Mar 2011, Mirar @ Pike developers forum wrote:
Shouldn't random produce a linear result between 0 and the value? If the value is 1e308, it's 1% chance that you get a value below 1e306, and 0.1% chance that you get a value below 1e305. It seems correct to me.
Ok, so I finally got down to this. A patch is attached. Maybe someone can have a look first, since I changed some some other things aswell. I think it might be good to also do some mixing for pointers, even though its quite unlikely to generate datasets where the lower 32 bits of many pointers have the same value. I was playing around with some more complex fixed width mixing functions, but my tests suggested that they perform badly when using modulo prime in hash tables. Therefore, this patch simply uses XOR. I was thinking that using hash_svalue on nan should produce an error because I dont see how nan can be used hash tables in any reasonable way. Opinions?
arne
I think I would still say something like
q = 2000000000 + round(log(f)*1000000);
which has the benefit of being portable and that numbers that are close to each other will hash to the same values.
Special cases is needed for inf and nan, though.
I see the benefit of not relying on the binary representation. But still, this hash only really makes sense for doubles. And then, simply using the first 32 bits would achieve something similar, i.e. hashing close numbers to the same value in less cycles.
Why is hashing close numbers to the same value a benefit? Especially for hash tables I think it would be good to avoid situations where collisions can be produced systematically easily.
On Fri, 18 Mar 2011, Mirar @ Pike developers forum wrote:
I think I would still say something like
q = 2000000000 + round(log(f)*1000000);
which has the benefit of being portable and that numbers that are close to each other will hash to the same values.
Special cases is needed for inf and nan, though.
I think xor is good. Depending on the binary representation shouldn't be a problem, since the hash value quite clearly isn't usable outside even the process instance:
/.../ *! The hash will be the same for the same value in the running *! process only (the memory address is typically used as the basis *! for the hash value). /.../
It will de facto become less stable for floats though, but I'd say that it's fairly unlikely there's any code out there that depends on that. I'd probably not even bother with #pike 7.8 compat goo.
I too believe that hashing close numbers to the same value would be a weakness rather than a strength. Hash values are not for sorting, rather the opposite.
And not least, xor is a lot faster. I've gathered the general consesus today is that hash quick functions are much better than mathematically pleasing ones.
E.g. all those prime modulos for hash tables in pike are probably performance killers. We ought to replace them by plain power-of-two sizes and simple masking.
I did some tests on different integer mixing functions and power of two hash tables. I tried the ones by thomas wang, bob jenkins and something I constructed out of MurmurHash (references on google). They all generate much less uniform distributions than modulo prime. When hashing strings (the content) the story is different. However, that only helps for the shared string table. I think it would be reasonable to conduct some more experiments...
arne
On Sat, 19 Mar 2011, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
And not least, xor is a lot faster. I've gathered the general consesus today is that hash quick functions are much better than mathematically pleasing ones.
E.g. all those prime modulos for hash tables in pike are probably performance killers. We ought to replace them by plain power-of-two sizes and simple masking.
Interesting. I imagine integers can be tough in general since one cannot say anything about their distribution.
What about power-of-two tables where the hashes based on pointers (shifted down a couple of bits)? That's the most common case in pike.
/.../ I think it would be reasonable to conduct some more experiments...
Yes, that'd definitely be a good start.
I tried with the strings in all_constants(). I was only looking at the variance of bucket size. So this was basically using ptr >> 2. I can recollect the code and put together some statistics with different keysets to make a start.
On Sat, 19 Mar 2011, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Interesting. I imagine integers can be tough in general since one cannot say anything about their distribution.
What about power-of-two tables where the hashes based on pointers (shifted down a couple of bits)? That's the most common case in pike.
/.../ I think it would be reasonable to conduct some more experiments...
Yes, that'd definitely be a good start.
I put together a script that aims to simulate collisions for different hashes. You can download from http://laramies.com/hashtest.pike It displays average absolute deviation and variance for bucket fill. Feel free to change this and comment. I put in some rudimentary key sets and the results basically say the same as I suggested before. I also added the one from HashMap.java, which seems to be as good and allows for power of two hash table sizes.
Seems to me that HashMap.java is doing a good job and is very simple at the same time.
Yes, it does surprisingly well. Also, I did not include it in my initial tests, which led to my conclusion. I also tried this with my last patch and random floats/ints where basically all those hash functions perform badly. However, I dont think we can do anything about it and maybe its also not a very important use case. I would therefore also propose to use the one from HashMap.
On Sun, 20 Mar 2011, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Seems to me that HashMap.java is doing a good job and is very simple at the same time.
I added the hash function from hashmap and changed mapping.c to use power of two hash tables. you can get the patch from http://laramies.com/hash.patch
I benchmarked this against pike 7.9 using pike -x benchmark and a couple of benchmarks we wrote to test our critbit trees. In some benchmarks (like Mapping insert) it leads to about 10% performance increase. Others dont profit that much. I ran the benchmarks on my atom and a core i7.
Nice work. I suspect a lot of the time in mapping tests written in pike is spent outside the hashing and bucket lookup functions themselves. I.e. a total increase of 10% implies that the core parts are quite a lot faster.
That's just my guess though. To really make a good comparison one would have to write tests in C, but that's not necessary to convince me. I think your patch looks good for 7.9.
Just a couple of notes: find_next_power maybe fits better in stuff.c, and the new bit fiddling at the end of hash_svalue could use a comment on the origin and rationale behind it.
I will clean it up a little bit and add some comments. However, I am traveling right now, so it will probably have to wait until next week. If noone complains until then, I will commit it into 7.9.
I would also propose to backport the XOR for float and 64bit integer in hash_svalue to 7.8.
On Sun, 27 Mar 2011, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Nice work. I suspect a lot of the time in mapping tests written in pike is spent outside the hashing and bucket lookup functions themselves. I.e. a total increase of 10% implies that the core parts are quite a lot faster.
That's just my guess though. To really make a good comparison one would have to write tests in C, but that's not necessary to convince me. I think your patch looks good for 7.9.
Backport to 7.8 you say... I sort of skimmed the float hash discussion, could someone rerun the advantage with it for me real quick?
And please also give me your best estimation for chances of regression. :)
So the advantage, to me is that we dont have everything from Float.MAX/16843009 upwards hash to zero (similar for negative ones in the other direction). I admit that its maybe not too bad in practice, but I can imagine that it happens to people.
As for regressions, I think that something using all bits and mod prime should not perform worse. In particular, it should be harder to produce collisions by accident.
The situation for int64 is maybe more clear. Right now the upper 32bits are not used at all in the hash and I suspect that its more likly to actually happen (when using bitmasks or something like that).
On Wed, 30 Mar 2011, Peter Bortas @ Pike developers forum wrote:
Backport to 7.8 you say... I sort of skimmed the float hash discussion, could someone rerun the advantage with it for me real quick?
And please also give me your best estimation for chances of regression. :)
Fair enough.
Hm. I thought it was about the hash() function, but I see it doesn't accept anything else than strings anyway. I couldn't quite believe the internal hashing was this flawed... :p
A much better hash function would be to just reinterpret the bit pattern in the float as an integer, I think. Maybe xor'ing the high and low ends if the floats are longer than INT32 (which they typically are on 64 bit architectures).
Don't know if it could be a compat problem changing it though. Strictly speaking it shouldn't be, considering the doc for hash_value, but that's not necessarily the end of the story.
pike-devel@lists.lysator.liu.se