Re: [dev] Different versions of suckless libutf from Connor Lane Smith on 2016-06-01 (dev mail list archive)

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Wed, 1 Jun 2016 19:36:11 +0100

On 1 June 2016 at 18:43, Kamil Cholewiński <harry666t_AT_gmail.com> wrote:
> The 95% use case here is handling UTF8-encoded Unicode text. Secure by
> default should be the norm, not a magic flag, not buried in a readme.

Obviously nobody is arguing for magic flags or burying things in a readme.

> If you need to encode an arbitrarily large integer into a stream of
> bytes, then use a library specifically designed for encoding arbitrarily
> large integers into streams of bytes.

Or anything about arbitrarily large integers.

> Yes, we're making up problems.

I think you missed the point.

The problem is not about what needs to be done, but about who needs to
do what. You're saying that libutf, a UTF-8 library, should do Unicode
validation. I am saying that Unicode validation is up to a Unicode
library, and that a UTF-8 library should do nothing but parse UTF-8.
If a UTF-8 stream is invalid, there are two possible sources for the
fault. One is that it may contain 0xFE or 0xFF, or be overlong, in
which case it is the UTF-8 that is at fault. Another is that it may be
an invalid Unicode character, in which case it is not the UTF-8 that
is at fault, but rather the Unicode -- and whether that is true is
dependent on the current Unicode standard.

So what we're talking about is what a UTF-8 library should do: should
it validate Unicode, or just UTF-8? The distinction is that if we have
a particular interface for Unicode concerns, like surrogates or
graphemes, then only that interface needs to track the latest Unicode
standard, as Ben explained, whilst the interfaces for handling UTF-8
or UTF-32 alone can be fixed, unchanging. This encourages a separation
of concerns, which makes bugs less likely (as Unicode is a moving
target), and also reduces bit rot (a UTF-8 library will not stop being
valid if Unicode changes).

As I said, the fundamental question is what libutf should actually do.
Is it a UTF-8 library, or a Unicode library? It may be both, but then
there is a strong argument that it should also support everything else
Unicode requires, which is an awful lot. My recent inclination has
been towards supporting only the raw encoding, and then a higher-level
interface would handle the Unicode validation as well as everything
else specific to this version of Unicode. That way we could tackle
UTF-8 without having to bother with all of the other craziness. It
would be a fixed, static library, without having to track the
standard.

Now, if someone wants to deal with Unicode then they could use a
Unicode library, not just a UTF-8 library, as all that does is encode
and decode. For all that stuff that is specific to a Unicode standard,
and which is true no matter which encoding we use -- UTF-8, UTF-32,
UTF-1, UTF-7, etc. -- that can all be put in a separate library. Thus,
Unicode validation is in one place, not distributed amongst several
interfaces which may be updated at different rates, use outdated
Unicode, and all that sort of thing. Hence, a separation of concerns.

So the question is whether libutf is meant to deal only with UTF-8
(which is constant), or other Unicode features too (which are
dynamic). The arguments on either side are essentially: stability, or
convenience? As I say, I've not yet made up my mind, but I don't think
the problem is made up. Maybe we'll end up deciding to place libutf
somewhere between the two, rejecting surrogates and values over
0x10FFFF but stopping short of supporting the character classes of
this specific Unicode version (which has changed several times already
since I first wrote libutf). At the very least, I think the discussion
is worth having.

cls
Received on Wed Jun 01 2016 - 20:36:11 CEST

This archive was generated by hypermail 2.3.0 : Wed Jun 01 2016 - 20:48:10 CEST