In my quest for a good design for multi-cpu pike I ran across this rather interesting design for a lock free hash table:
http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html
The most complete description of it is a taped talk:
http://video.google.com/videoplay?docid=2139967204534450862
The way it can be resized without locking is pretty cool. Several threads can even share the work to carry out the copying.
His design parameters for the hash table are a bit unorthodox though:
o Size is always power of two (not prime). o It's closed. o Reprobing is linear. o Keys are never deleted (a "resize" to the same size is necessary). o It never shrinks.
The first three together are a bit scary if the hash function is bad, but maybe it's just me..
This could put to good use for the shared string table, but probably more places too. I think it'd be good to implement shrinking, though.
Avoiding power-of-two is something wikipedia ("Hash table") comments on:
Some older hashes are even worse, requiring table sizes to be a prime number rather than a power of two, again computing the bucket index as the hash value modulo the table size. In general, such a requirement is a sign of a fundamentally weak function; using a prime table size is a poor substitute for using a stronger function.
Add to that that prime numbers mean modulo functions which are very expensive on modern CPUs. If I recall correctly the cycle count on x86_64 is in the hundreds which obviously cannot compare to masking and shifting.
A good hash function is of course desireable, and a prime size is surely unnecessary then. But it's nice if the implementation behaves reasonably even for lousy hash functions, especially if the table is used for the ordinary mappings where it can be user supplied.
Here's a trick someone mentioned in the comments to calculate the modulo a bit quicker:
The thing is ... with a hash table, the table stays the same size for quite a whle, so what you're taking the modulus with respect to stays the same for quite a while, so you can computer an integer reciprocal of the size when you resize the hash table, and then use the same reciprocal for a long time. So for a hash code of H and a table size of N (with reciprocal R = 2^32/N) you're looking at:
bucket = H - ((R*H)>>32)*N bucket -= N if bucket >= N
OK, this is about ten times the cost of a simple AND, assuming you can grab just the upper half of a multiply, and cheap conditional execution (both of which I have on ARM). /.../
But I'm not sure. It could be a non-issue in practice.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
A good hash function is of course desireable, and a prime size is surely unnecessary then. But it's nice if the implementation behaves reasonably even for lousy hash functions, especially if the table is
Is it?
used for the ordinary mappings where it can be user supplied.
It basically means that you try to improve on the hash function by adding one extra computational round (the prime modulo). This is a half-hearted fix at best. IMO if someone supplies a bad hash function, they can't expect the system to fix it. It would imply that due to this "fix" we might make perfectly good hash-functions worse.
You're right that it's only putting some extra computation into it, but I don't think there is any way that a prime modulo would make any good hash function worse; the only problem is some wasted effort.
A problem is that I'm not sure of the quality of the hash functions we use internally either. The string hash is a homebrew (afaik), and in several other places pointers are used as hash values.
As for the string hash, I made an attempt to search for published alternatives, but I couldn't find any that hashes in O(1) - all of them hashes on the whole string. I think an O(1) hash function is very important in our case. Anyway, it should be possible to sort this out.
Pointers are worse. They can contain all sorts of periodicity. An obvious one is that the lowest one or two bits always are zero due to alignment, but there might be more due to the malloc implementation and its allocation patterns, and you'd never know about it (well, in the case of all the block_alloc areas we do know a bit about it, but it's different for each and every one of them). A prime modulo is very effective to put all such concerns to rest. Maybe there are other smart ideas on how to make good hashes out of pointers; I haven't checked yet.
One more concern is also how it behaves for the ultimately bad hash function, i.e. one that returns the same value all the time. In that case the hash table is of course O(n) on lookup and store, but it shouldn't try to grow the table all the time due to a reprobe limit. I don't know if he found a clever way to avoid that.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Pointers are worse. They can contain all sorts of periodicity. An
it's different for each and every one of them). A prime modulo is very effective to put all such concerns to rest. Maybe there are other
Is Pike doing a lot of pointer hashing? Incidentally, using a prime module to hash a pointer sounds fine to me, it just means that the modulo is part of the hash function. The only thing I find "bad design" is if the hash-table lookup forces a prime modulo, even if a good hash function has already been used. We should focus on improving the hash functions in general.
One more concern is also how it behaves for the ultimately bad hash function, i.e. one that returns the same value all the time. In that case the hash table is of course O(n) on lookup and store, but it shouldn't try to grow the table all the time due to a reprobe limit. I don't know if he found a clever way to avoid that.
I don't think that was discussed during the talk. However, pondering the problem for a brief moment, I can imagine something like the following:
Keep a relative live count per thread (to avoid thread contention). Increment it when inserting a new item, decrement it when deleting an item (Tombstone). Then whenever it seems necessary to grow the hashtable, quickly sum up all the live counters and you'll have a fairly accurate population count of the total hashtable. If the ratio is below a certain threshold, do not grow the table.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Pointers are worse. They can contain all sorts of periodicity. An
it's different for each and every one of them). A prime modulo is very effective to put all such concerns to rest. Maybe there are other
Is Pike doing a lot of pointer hashing?
Well, for everything except integers, floats and objects with lfun::__hash() (ie most objects, strings, programs, etc) used as index in a mapping (cf svalue.c:hash_svalue())...
Incidentally, using a prime module to hash a pointer sounds fine to me, it just means that the modulo is part of the hash function. The only
Multiplying with a suitable prime and then shifting and masking would most likely be of similar strength, but be much cheaper cpu-wise.
thing I find "bad design" is if the hash-table lookup forces a prime modulo, even if a good hash function has already been used. We should focus on improving the hash functions in general.
Agreed.
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
Incidentally, using a prime module to hash a pointer sounds fine to me, it just means that the modulo is part of the hash function. The only
Multiplying with a suitable prime and then shifting and masking would most likely be of similar strength, but be much cheaper cpu-wise.
Silly of me to forget, yes. The mother of all PRNGs: seednew = seedold * 67067 (in unsigned 32-bit overflowing arithmetic) should do fine for most applications.
/.../ We should focus on improving the hash functions in general.
True. Grubbas way of getting rid of periodicities in the pointers sounds reasonable.
/.../ Then whenever it seems necessary to grow the hashtable, quickly sum up all the live counters and you'll have a fairly accurate population count of the total hashtable. If the ratio is below a certain threshold, do not grow the table.
Yes, that ought to work. The cost is then that all threads starts summing all those counters more or less every time when the reprobe limit is busted regularly, thus leading to frequent cache line synching on the counters. But then again, if the hash function is bad, things will get bad.. But maybe one could consider ways to dynamically raise the reprobe limit.
We'd need a somewhat accurate sizeof() anyway if it should replace the standard mappings(*), and to implement shrinking.
Speaking of counters, that brings me to another issue: Mr Click very conveniently avoids the problem with the free of the old table after resize since he leaves it to the java gc. It's not that simple for us. I guess we'll have to keep similar concurrent access counters for the references to the hash table blocks. Even so, it's still not trivial to free it in a safe way.
That concurrent access counter is also something that will see use in many places (although the general refcounting will probably be disabled for most shared data), so that in itself is something that should be implemented with care. I'm thinking something like this:
An array of counters is allocated for every thread, and there is a function that allocates a counter by returning a free index in those arrays. So every thread has its own counter array in the local cache in r/w mode, while the others only need (more or less) occasional read access. The allocation and freeing of counters should also be lock-free, although that's not strictly necessary.
That should work, I think. One other wisdom from Mr Click is that one should never put concurrent access counters in the same memory blocks that they governs. I don't think we can afford to follow that advice for all refcounters, though.
*) Something I actually consider somewhat optional - the plan is first and foremost to implement an efficient concurrent hash table class that can be used instead of the standard mapping in places where it is necessary. If the standard mappings can be made lock free too then all the better, but that depends on if their semantics can be kept. The most important hash tables are actually the string hash table and the various other internal hash tables that has to allow concurrent use.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Speaking of counters, that brings me to another issue: Mr Click very conveniently avoids the problem with the free of the old table after resize since he leaves it to the java gc. It's not that simple for us. I guess we'll have to keep similar concurrent access counters for the references to the hash table blocks. Even so, it's still not trivial to free it in a safe way.
I don't think it's that complicated. After the copy has completed, the old table doesn't contain any references anymore, and can be freed without actually traversing it.
An array of counters is allocated for every thread, and there is a function that allocates a counter by returning a free index in those arrays. So every thread has its own counter array in the local cache in r/w mode, while the others only need (more or less) occasional read access. The allocation and freeing of counters should also be lock-free, although that's not strictly necessary.
Sounds good.
I don't think it's that complicated. After the copy has completed, the old table doesn't contain any references anymore, and can be freed without actually traversing it.
It's not enough to verify that the table isn't useful anymore; one must ascertain that no other thread is still looking inside it. For that a (concurrent) ref counter is required, and something like this is still not safe when releasing it:
if (decrement_counter (cnt) == 0) { // Race from here void *p = old_array_ptr; old_array_ptr = NULL; // to here. free (p); }
To sort that out I think it's best to take the same approach as Cliff Click and reason about it as a state machine with states made up of {old_array_ptr, cnt}:
{NULL, 0} {NULL, >= 1} {*, 0} {*, >= 1}
We did it on a whiteboard today - a good excercise.
And of course, cnt (the index/"handle" of the concurrent counter) cannot be stored inside the old hash table block - it has to be stored alongside the pointer to it.
So it's solvable, but still requires some thought.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
In my quest for a good design for multi-cpu pike I ran across this rather interesting design for a lock free hash table:
Very impressive.
His design parameters for the hash table are a bit unorthodox though:
o Size is always power of two (not prime). o It's closed. o Reprobing is linear.
This is rather cache-friendly, so it seems like a Good Thing to me.
o Keys are never deleted (a "resize" to the same size is necessary).
Keys can actually be deleted, that's what the Tombstone key is for.
o It never shrinks.
The algorithm allows for shrinking the same way it allows for growth, it's just that he hasn't implemented it. The reason why (as he states it) is that you'd need to be careful that you don't get one thread shrinking and another growing and another then shrinking again, and so forth. Other than that, the method for growing and shrinking basically is the same.
This could put to good use for the shared string table, but probably more places too. I think it'd be good to implement shrinking, though.
It looks like a superior hash-table design overall, is there a reason why you wouldn't want to use it to replace any of the existing hash-tables?
This is rather cache-friendly, so it seems like a Good Thing to me.
Yes indeed. Linear reprobe is probably a winner overall.
o Keys are never deleted (a "resize" to the same size is necessary).
Keys can actually be deleted, that's what the Tombstone key is for.
That just means they don't have a valid value. The end user won't see the difference, but the key in the table is never deleted so that the slot can be reused. That's what I meant.
This means that the worst use case is a table with short lived entries that is fed new keys all the time (deleting and readding the same key is not a problem). Something like that isn't entirely uncommon in real-world caches, but then the turnaround is on a scale of several minutes so the resize overhead is probably small overall.
The algorithm allows for shrinking the same way it allows for growth, it's just that he hasn't implemented it. The reason why (as he states it) is that you'd need to be careful that you don't get one thread shrinking and another growing and another then shrinking again, and so forth. Other than that, the method for growing and shrinking basically is the same.
Double when full, shrink when 3/4'ths empty is usually a reasonable policy to avoid that sort of thing.
pike-devel@lists.lysator.liu.se