Re: [dev] Some misc tools that could help others from NRK on 2022-09-21 (dev mail list archive)

From: NRK <nrk_AT_disroot.org>
Date: Thu, 22 Sep 2022 00:21:16 +0600

On Wed, Sep 21, 2022 at 04:24:28PM +0000, Hadrien Lacour wrote:
> The other "interesting" one is genhtab, an alternative to gperf that is much
> simpler to use and produces smaller binaries for big tables (but slower, cf
> genhtab_bench) using a simple chained hashing table (with some tricks due to
> being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if the
> speed issue can be mitigated without too much binary bloating.

        const uint32_t bkt = hash & (ht.len - 1);

Due to mul being the last step in fnv1a, the entropy is stored in the
high bits, so you want to be masking the high bits and not the low ones.

Alternatively you can change the hash function to mix the high bits with
the low ones just before returning:

        hash ^= hash >> 33; /* for the 64bit version. */

But I'd much rather just mask the high bits instead.

And for statically generated hash-tables (as well as in general) I tend
to prefer open-addressing rather than chaining. For probing I prefer
linear probe with robin-hood insertion[0] for statically generated
tables.

As for why: linear probing nicely exploits spatial locality[1] and
robin-hood can reduce the max probe size by alleviating clustering[2].
This leads to both faster worst-case lookups as well as less memory
usage.

For run-time tables where you don't know the inputs, double-hashing
might be a better approach[3]. But for statically generated one, I
wouldn't use double-hashing since you can easily change the hash
function until it's producing good results (e.g for fnv1a changing the
starting state or the multiplier, or both[4]).

Also one more trick that can reduce memory usage is to eliminate pointer
overhead and use an array of arrays, i.e using `char [][n]` instead of
`char *[]`. This isn't _guaranteed_ to reduce mem usage though, so
you'll have to do some calculation to figure it out, i.e:

        /* approximate mem usage for `char *[]` */
        for (sum = 0, i = 0; i < LEN(input); ++i)
                sum += (strlen(input[i]) + 1) + sizeof (char *);
        /* vs char[][max_input_strlen + 1] */
        sum = LEN(input) * (max_input_strlen + 1);

Off the top of my head, these would be some of the major tricks I use
when building a statically generated hash-tables. And yeah, I prefer
building it on a case by case basis instead of using some pre-made
generator :)

[0]: https://programming.guide/robin-hood-hashing.html
[1]: https://en.wikipedia.org/wiki/Locality_of_reference
[2]: https://en.wikipedia.org/wiki/Primary_clustering
[3]: https://nullprogram.com/blog/2022/08/08/
[4]: https://github.com/jarun/nnn/blob/de3ad1b14618392753051959b2e7ac52f70252a1/src/icons-hash.c#L124-L125

- NRK
Received on Wed Sep 21 2022 - 20:21:16 CEST

This archive was generated by hypermail 2.3.0 : Wed Sep 21 2022 - 20:24:07 CEST