Re: [dev] Different versions of suckless libutf from Connor Lane Smith on 2016-05-31 (dev mail list archive)

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Tue, 31 May 2016 19:33:12 +0100

On 31 May 2016 at 18:43, FRIGN <dev_AT_frign.de> wrote:
> as a quick note, the sbase libutf is probably the most feature-rich one.
> The version by cls suffers from multiple issues, even though it might
> be the most recent.

Strictly speaking they're all by me, since I started it (and sbase) in
the first place. But there we are.

> I am currently working on a new libutf which is much simpler, much
> more secure (de/encoder) and actually gets the grapheme handling right.

One of the reasons I'm not pushing for any particular solution to the
fragmentation problem is that I'm not sure what libutf should actually
do. There are three components that are distinguishable in the Plan 9
API, which are UTF-8 (runetochar, chartorune, utf*, etc), UTF-32
(runestr*), and Unicode (is*rune, etc).

The trouble is I don't think it's necessary for a single library to do
all of these things. All UTF-8 is is an encoding of 31-bit integers,
and UTF-32 is another encoding. The stuff specific to Unicode, which
requires the latest Unicode database and all that, is really a
separate issue -- as is the rejection of certain values, like
surrogates or values over 0x10FFFF, both of which are only invalid
because of the braindead UTF-16 encoding. And grapheme handling is
another thing which has nothing actually to do with UTF.

So in earlier versions of libutf I was vigilant in rejecting those
values that Unicode say are invalid, but in my latest version on
github I've started only rejecting overlong sequences, since the
others are still (in my view) valid UTF-8 even if they aren't valid
Unicode. Is this the right thing to do? I've not yet made up my mind.
But my feeling is that the API for reading UTF-8 should be separate
from that which deals with Unicode codepoints and graphemes that so
happen to have been encoded in UTF-8. The two are essentially
orthogonal, though are often conflated.

Incidentally, I also changed my latest version to only ever need one
byte of lookahead. For one thing, the Plan 9 version will say that a
rune is not full even if it is, if it is malformed, which is fixed in
my implementation. But another thing, which is only in my latest
version, is that it always reads the fewest bytes needed to determine
that the sequence is malformed. One benefit of this is that if you're
reading with fgetc(), you can then ungetc() a byte that showed that
the sequence was malformed (say, it was too short), and you are only
guaranteed (by POSIX) to be able to ungetc() a single byte.

That may not be relevant for sbase, of course, but I'm just saying
there's a reason for the slight difference in complexity between the
version in sbase and the latest version on my github.

cls
Received on Tue May 31 2016 - 20:33:12 CEST

This archive was generated by hypermail 2.3.0 : Tue May 31 2016 - 20:36:11 CEST