Here are two useful references:
http://bretm.home.comcast.net/hash/
http://burtleburtle.net/bob/hash/
RE computation cost: Generalized integer multiplication and integer
division both used to be expensive. On modern processors a
multiplication is fully pipelined and comparable to an L1 data cache
hit (~2 to 4 cycles). Most modern implementations of division run the
divide circuit at a multiple greater than 1 of the cpu clock
(typically 2x) and implement early out. Further division is nearly
always an autonomous, non-pipelined logic block that once initiated
runs asynchronously until it is ready to deliver its result. Bottom
line: on any contemporary out-of-order x86 chip from either Intel or
AMD the cost of either of these instruction is negligible.
RE general hashing concepts: I find it helpful to think of a hash
table lookup as proceeding in steps. The first step is to generate a
fixed size digest of the (possibly variable length) key. On a 32 bit
machine this digest most commonly is a 32 bit (unsigned) integer. The
quality of a digest is related to how much of the entropy in the key
it is able to preserve. The second step is to map the digest into a
bucket selector. Nearly universally this is a modulo operation taking
the remainder of the digest divided by the number of buckets. A
particularly poor approach is to have a power of 2 number of buckets
so as to to reduce the modulo operation to simple masking. The reason
is that this discards all entropy in all masked out bits. A _much_
better approach is to pick the number of buckets to be a prime number
-- the remainder of division by a prime number will be influenced by
_every_ bit in the digest. (When I need to implement a dynamic hash
table I always include a statically initialized array of primes chosen
to give me the growth pattern I desire.)
/john
Received on Wed Aug 04 2010 - 15:22:58 CEST
This archive was generated by hypermail 2.2.0 : Wed Aug 04 2010 - 15:24:02 CEST