Re: [dev] Some misc tools that could help others from Hadrien Lacour on 2022-09-21 (dev mail list archive)

From: Hadrien Lacour <hadrien.lacour_AT_posteo.net>
Date: Wed, 21 Sep 2022 20:08:06 +0000

On Wed Sep 21, 2022 at 8:21 PM CEST, NRK wrote:
> On Wed, Sep 21, 2022 at 04:24:28PM +0000, Hadrien Lacour wrote:
> > The other "interesting" one is genhtab, an alternative to gperf that is much
> > simpler to use and produces smaller binaries for big tables (but slower, cf
> > genhtab_bench) using a simple chained hashing table (with some tricks due to
> > being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if the
> > speed issue can be mitigated without too much binary bloating.
>
> const uint32_t bkt = hash & (ht.len - 1);
>
> Due to mul being the last step in fnv1a, the entropy is stored in the
> high bits, so you want to be masking the high bits and not the low ones.
Huh, makes sense, I was somehow counting on it having a good distribution.

> And for statically generated hash-tables (as well as in general) I tend
> to prefer open-addressing rather than chaining. For probing I prefer
> linear probe with robin-hood insertion[0] for statically generated
> tables.
>
Never liked the conceptual complexity of open addressing (and how you can lock
on a single bucket, in read-write MT scenarios), but it could be better, indeed.
Another thing I like about chained hashing, is how you can just change the
bucket type according to your read/write performance needs (vector, skiplist,
avl tree, etc...).

> As for why: linear probing nicely exploits spatial locality[1] and
> robin-hood can reduce the max probe size by alleviating clustering[2].
> This leads to both faster worst-case lookups as well as less memory
> usage.
In this specific case, there's not much cache improvement since the bucket
contents are stored sequentially, without pointers except for the keys.

> For run-time tables where you don't know the inputs, double-hashing
> might be a better approach[3]. But for statically generated one, I
> wouldn't use double-hashing since you can easily change the hash
> function until it's producing good results (e.g for fnv1a changing the
> starting state or the multiplier, or both[4]).
>
> Also one more trick that can reduce memory usage is to eliminate pointer
> overhead and use an array of arrays, i.e using `char [][n]` instead of
> `char *[]`. This isn't _guaranteed_ to reduce mem usage though, so
> you'll have to do some calculation to figure it out, i.e:
>
> /* approximate mem usage for `char *[]` */
> for (sum = 0, i = 0; i < LEN(input); ++i)
> sum += (strlen(input[i]) + 1) + sizeof (char *);
> /* vs char[][max_input_strlen + 1] */
> sum = LEN(input) * (max_input_strlen + 1);
>
That's a very special case, but worth thinking about, indeed.

To be honest, performance wasn't the focus, since I can't imagine beating
gperf -lC.
Received on Wed Sep 21 2022 - 22:08:06 CEST

This archive was generated by hypermail 2.3.0 : Wed Sep 21 2022 - 22:12:08 CEST