Re: [dev] Different versions of suckless libutf from Connor Lane Smith on 2016-06-01 (dev mail list archive)

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Wed, 1 Jun 2016 09:51:23 +0100

On 1 June 2016 at 07:42, Ben Woolley <tautolog_AT_gmail.com> wrote:
> I am pretty sure you are aware of this already, but the UTF-8 RFC
> defines Unicode quirks as part of the UTF-8 definition. Even the title
> is "UTF-8, a transformation format of ISO 10646". It does not call it a
> general purpose transformation format of 31-bit integers. I didn't
> glance at other definitions, if they exist. Maybe they say something
> else.

I may have been a bit loose with the terminology. However, UTF-8 was
originally defined as an encoding of "character values in the range
(0, 0x7FFFFFFF)," as it was for UCS-4, which went beyond the mere
0xFFFF limit of Unicode at the time (which was equivalent to UCS-2).
Only later were the Unicode restrictions put in place, at the same
time that the Unicode limit was increased to 0x10FFFF and UTF-16 was
made to use surrogates as a crutch. (And the fact that these were made
to pollute the character space demonstrates why UTF-16 is not only
useless, but worse than useless.)

It's true that it isn't defined as a general-purpose format for 31-bit
integers, but rather of "character values" that happen to be 31-bit
integers. However, the fact remains that it *is* just an encoding of
31-bit integers. Those integers are (almost) always unpacked and only
then checked for Unicode validity. It seems to me that if you have a
char32_t, you should be able to check whether that character is a
Unicode character with some function like isvalidrune(). Plan 9 seems
to have no way to do this, although my earlier libutf versions had
runelen() return 0 for invalid Unicode.

> But anyway, I am wondering why you seem to have mental pressure to
> generalize it more. Is it more of a design aesthetic thing? I can see
> that. Personally, I could see having separate functions, but I think
> they should be packaged together, because if someone really wanted to
> rip out the general pieces, they can easily do that when needed.

It probably is mostly about aesthetics. One frustration is the
dependence on the Unicode standard, since they keep changing what
values are valid or invalid (in 1996 and 2003), when the actual UTF-8
format hasn't changed one bit since 1993. So I feel that the UTF-8
codec itself should ignore those political issues and simply deal with
UTF-8 proper. You can check whether a value is valid Unicode once
you've got it from the UTF-8 stream, and do so with the same function
as you would if you were reading UTF-16 or UTF-32. Or any other format
people might use, like UTF-1 or UTF-7 (or not).

This interface (reading and validating a UTF-8 rune) may well ought to
be available as one function, but I feel that it should be a wrapper
for a more fundamental UTF-8 decoder, because the latter is 'forever
and always', whereas the former depends on whichever version of
Unicode we're on. But even if you do think that the fundamental
decoder should validate Unicode in the sense of forbidding surrogates
etc., the is*rune() and to*rune() functions, and anything that would
properly handle graphemes according to Unicode, or anything involving
canonicalisation or any of the other incredibly complicated aspects of
the Unicode standard, are nothing to do with UTF.

(Anyway, UTF-8 is really just a framing protocol for 6-bit data, with
sync and roll flags. :p)

cls
Received on Wed Jun 01 2016 - 10:51:23 CEST

This archive was generated by hypermail 2.3.0 : Wed Jun 01 2016 - 11:00:13 CEST