Re: [dev] Different versions of suckless libutf

From: Ben Woolley <tautolog_AT_gmail.com>
Date: Tue, 31 May 2016 23:42:14 -0700

> On May 31, 2016, at 11:33 AM, Connor Lane Smith <cls_AT_lubutu.com> wrote:
>
>> On 31 May 2016 at 18:43, FRIGN <dev_AT_frign.de> wrote:
>> as a quick note, the sbase libutf is probably the most feature-rich one.
>> The version by cls suffers from multiple issues, even though it might
>> be the most recent.
>
> Strictly speaking they're all by me, since I started it (and sbase) in
> the first place. But there we are.
>
>> I am currently working on a new libutf which is much simpler, much
>> more secure (de/encoder) and actually gets the grapheme handling right.
>
> One of the reasons I'm not pushing for any particular solution to the
> fragmentation problem is that I'm not sure what libutf should actually
> do. There are three components that are distinguishable in the Plan 9
> API, which are UTF-8 (runetochar, chartorune, utf*, etc), UTF-32
> (runestr*), and Unicode (is*rune, etc).
>
> The trouble is I don't think it's necessary for a single library to do
> all of these things. All UTF-8 is is an encoding of 31-bit integers,
> and UTF-32 is another encoding. The stuff specific to Unicode, which
> requires the latest Unicode database and all that, is really a
> separate issue -- as is the rejection of certain values, like
> surrogates or values over 0x10FFFF, both of which are only invalid
> because of the braindead UTF-16 encoding. And grapheme handling is
> another thing which has nothing actually to do with UTF.

I am pretty sure you are aware of this already, but the UTF-8 RFC defines Unicode quirks as part of the UTF-8 definition. Even the title is "UTF-8, a transformation format of ISO 10646". It does not call it a general purpose transformation format of 31-bit integers. I didn't glance at other definitions, if they exist. Maybe they say something else.

But anyway, I am wondering why you seem to have mental pressure to generalize it more. Is it more of a design aesthetic thing? I can see that. Personally, I could see having separate functions, but I think they should be packaged together, because if someone really wanted to rip out the general pieces, they can easily do that when needed. However, I think probably every time someone consumes the interface, they are expecting it all together. I mean, if you want to be the one who makes it available in pieces for the sake of availability, then that is a valid choice. But you seem to be unsure of what to do. Me? I put them together. I have put them together before, in fact, so I have made this exact decision before. I hope this helps you in some way. :)

> So in earlier versions of libutf I was vigilant in rejecting those
> values that Unicode say are invalid, but in my latest version on
> github I've started only rejecting overlong sequences, since the
> others are still (in my view) valid UTF-8 even if they aren't valid
> Unicode. Is this the right thing to do? I've not yet made up my mind.
> But my feeling is that the API for reading UTF-8 should be separate
> from that which deals with Unicode codepoints and graphemes that so
> happen to have been encoded in UTF-8. The two are essentially
> orthogonal, though are often conflated.
>
> Incidentally, I also changed my latest version to only ever need one
> byte of lookahead. For one thing, the Plan 9 version will say that a
> rune is not full even if it is, if it is malformed, which is fixed in
> my implementation. But another thing, which is only in my latest
> version, is that it always reads the fewest bytes needed to determine
> that the sequence is malformed. One benefit of this is that if you're
> reading with fgetc(), you can then ungetc() a byte that showed that
> the sequence was malformed (say, it was too short), and you are only
> guaranteed (by POSIX) to be able to ungetc() a single byte.
>
> That may not be relevant for sbase, of course, but I'm just saying
> there's a reason for the slight difference in complexity between the
> version in sbase and the latest version on my github.
>
> cls
>
Received on Wed Jun 01 2016 - 08:42:14 CEST

This archive was generated by hypermail 2.3.0 : Wed Jun 01 2016 - 08:48:10 CEST