Re: [dev] Different versions of suckless libutf

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Sat, 18 Jun 2016 08:43:01 +0100

Hi all,

Following this past conversation, I decided to reinstate rune validity
checks in libutf. Since people seem to be using my repo as a
submodule, I decided it was best to cater for that (somewhat
questionable) use case.

> I would have liked to have separated UTF-8 and Unicode support into two
> separate libraries. Unicode has changed the definitions of valid and
> invalid codepoints a number of times, whilst UTF-8 has remained as it
> is, unchanging. Likewise, the current version of Unicode ought not be
> necessary only to parse UTF-8 sequences. However, it is clear that it is
> expected that libutf will do this, and I think adding another library as
> a dependency would undermine the appeal of a minimalist UTF-8 library.
>
> It's not a very happy situation though, since attempting to catch all
> possible sources of invalid runes, rather than only those that are truly
> malformed UTF-8, would require much more code if it were to detect them
> at the earliest possible opportunity, as is done with things like
> overlong encodings. So my solution has been to treat those as a separate
> class of error, and to detect validity of the rune, as opposed to the
> UTF-8 sequence, as a matter of postprocessing.
>
> As I say, this isn't a happy situation, but I think this is the best
> compromise between those mortal enemies, pragmatism and idealism.

So, to reiterate the above, I've separated out that check, so there
are two distinct classes of error: UTF-8 errors, and Unicode errors.
The former, which are malformed UTF-8 sequences, are detected at the
earliest instant, whilst the latter, which are just invalid according
to the Unicode consortium, are detected only after the rune value has
been unpacked. I think that's the best compromise, such as it is.

Thanks,
cls
Received on Sat Jun 18 2016 - 09:43:01 CEST

This archive was generated by hypermail 2.3.0 : Sat Jun 18 2016 - 09:48:11 CEST