Re: [dev] The mysterious 31 from Robert Ransom on 2010-08-04 (dev mail list archive)

From: Robert Ransom <rransom.8774_AT_gmail.com>
Date: Wed, 4 Aug 2010 01:55:07 -0700

On Tue, 3 Aug 2010 23:08:49 -0400
Kris Maglione <maglione.k_AT_gmail.com> wrote:

> On Tue, Aug 03, 2010 at 11:03:50PM -0400, Kris Maglione wrote:
> >There isn't much science to hash functions,
>
> Let me rephrase that, given the nature of this list. There's not
> much science to simple string hash functions.

More correctly, if all you're using a hash function for is assigning
strings to buckets in a small hash table, you can ignore most of the
mathematics involved.

If you have something fancier in mind, such as the Bentley-McIlroy
repeated substring detection algorithm, you may need to consider the
mathematical behaviour of your hash function. (The Bentley-McIlroy
algorithm computes a hash function over each contiguous N-character
substring of a string (which I will call a document here), and saves
every N-th substring (if N=10, it saves the hashes of d[0:9],
d[10:19], ...) in a hash table. If you encounter an N-char substring
which is already in your hash table, you have a repeated substring;
extend the match both backward and forward to find a repeated substring
of maximal length. This process is guaranteed to find all repeated
substrings of length 2N-1 or greater. Computing the hash function over
each N-character substring is tolerably efficient because they use a
polynomial hash function (almost, but not exactly like Mr. Todd's
example) in which one can quickly compute h(d[m:n]) from d[m], d[n],
and h(d[m-1:n-1]).)

Robert Ransom

application/pgp-signature attachment: signature.asc

Received on Wed Aug 04 2010 - 10:55:07 CEST

This archive was generated by hypermail 2.2.0 : Wed Aug 04 2010 - 11:00:04 CEST