On Sun, 29 May 2022 13:48:49 -0400
LM <lmemsm_AT_gmail.com> wrote:
Dear LM,
> I like that point. Not a fan of glib and I try to avoid software
> that uses it.
>
> Don't know how good they are, but I've run across several lighter
> utf-8 C libraries:
> https://github.com/cls/libutf
> https://github.com/JuliaStrings/utf8proc
> https://github.com/skeeto/branchless-utf8
> https://github.com/sheredom/utf8.h
> https://github.com/JulienPalard/is_utf8
>
> I wrote my own and use it, so I haven't tested these. Thought they
> were interesting though.
having dove deep into UTF-8 and Unicode, I can at least say that
libutf8proc has an unsafe UTF-8-decoder, as it doesn't catch overlong
encodings. There are also multiple other pitfalls.
I can shamelessly recommend you my UTF-8-codec[0] that's part of my
libgrapheme[1]-library, which also allows you to directly count
grapheme clusters (i.e. visible character units made up of one or more
codepoints). libutf8proc also offers grapheme cluster counting (among
other things, but also has the unsafe UTF-8-decoder) and used to be the
fastest library out there, but with a few tricks (much smaller LUTs) I
managed to make libgrapheme twice as fast.
I did a lot of benchmarking and tweaking and don't see any more room
for improvement in the codec, given you have branches for all the
edge-cases. The branchless UTF-8-decoder is very interesting, but may
lead to a buffer overrun.
With best regards
Laslo
[0]:
https://git.suckless.org/libgrapheme/file/src/utf8.c.html
[1]:
https://git.suckless.org/libgrapheme/
Received on Mon May 30 2022 - 10:55:55 CEST