Re: [dev] Different versions of suckless libutf

From: Ben Woolley <tautolog_AT_gmail.com>
Date: Wed, 1 Jun 2016 09:03:43 -0700

>> On Jun 1, 2016, at 1:51 AM, Connor Lane Smith <cls_AT_lubutu.com> wrote:
>>
>> On 1 June 2016 at 07:42, Ben Woolley <tautolog_AT_gmail.com> wrote:
>> I am pretty sure you are aware of this already, but the UTF-8 RFC
>> defines Unicode quirks as part of the UTF-8 definition. Even the title
>> is "UTF-8, a transformation format of ISO 10646". It does not call it a
>> general purpose transformation format of 31-bit integers. I didn't
>> glance at other definitions, if they exist. Maybe they say something
>> else.
>
> I may have been a bit loose with the terminology. However, UTF-8 was
> originally defined as an encoding of "character values in the range
> (0, 0x7FFFFFFF)," as it was for UCS-4, which went beyond the mere
> 0xFFFF limit of Unicode at the time (which was equivalent to UCS-2).
> Only later were the Unicode restrictions put in place, at the same
> time that the Unicode limit was increased to 0x10FFFF and UTF-16 was
> made to use surrogates as a crutch. (And the fact that these were made
> to pollute the character space demonstrates why UTF-16 is not only
> useless, but worse than useless.)
>
> It's true that it isn't defined as a general-purpose format for 31-bit
> integers, but rather of "character values" that happen to be 31-bit
> integers. However, the fact remains that it *is* just an encoding of
> 31-bit integers. Those integers are (almost) always unpacked and only
> then checked for Unicode validity. It seems to me that if you have a
> char32_t, you should be able to check whether that character is a
> Unicode character with some function like isvalidrune(). Plan 9 seems
> to have no way to do this, although my earlier libutf versions had
> runelen() return 0 for invalid Unicode.
>
>> But anyway, I am wondering why you seem to have mental pressure to
>> generalize it more. Is it more of a design aesthetic thing? I can see
>> that. Personally, I could see having separate functions, but I think
>> they should be packaged together, because if someone really wanted to
>> rip out the general pieces, they can easily do that when needed.
>
> It probably is mostly about aesthetics. One frustration is the
> dependence on the Unicode standard, since they keep changing what
> values are valid or invalid (in 1996 and 2003), when the actual UTF-8
> format hasn't changed one bit since 1993. So I feel that the UTF-8
> codec itself should ignore those political issues and simply deal with
> UTF-8 proper. You can check whether a value is valid Unicode once
> you've got it from the UTF-8 stream, and do so with the same function
> as you would if you were reading UTF-16 or UTF-32. Or any other format
> people might use, like UTF-1 or UTF-7 (or not).
>
> This interface (reading and validating a UTF-8 rune) may well ought to
> be available as one function, but I feel that it should be a wrapper
> for a more fundamental UTF-8 decoder, because the latter is 'forever
> and always', whereas the former depends on whichever version of
> Unicode we're on. But even if you do think that the fundamental
> decoder should validate Unicode in the sense of forbidding surrogates
> etc., the is*rune() and to*rune() functions, and anything that would
> properly handle graphemes according to Unicode, or anything involving
> canonicalisation or any of the other incredibly complicated aspects of
> the Unicode standard, are nothing to do with UTF.
>
> (Anyway, UTF-8 is really just a framing protocol for 6-bit data, with
> sync and roll flags. :p)

Thanks for the background information. Now that I am more informed, it seems that the issue is that the pollution of the name UTF-8 makes the code less self-documenting, because people may assume that a transformation function for characters also validates those as Unicode characters.

I see two things to do:
1. There could be a new name for the transformation that stands apart from UTF-8, which has now been changed from that original meaning.
2. There can be clear documentation about when validation occurs, despite the names.

Maybe call the transform CTF-8, where C is character. Then UTF-8 is just a wrapper around CTF-8. Then a comment explaining what CTF stands for and what it does should be sufficient.

Then there can be libctf and libutf living in cognitive harmony.

Instead of CTF, it could be called ur-UTF in liburutf. Then it would literally be urtext, which would be mildly amusing at least to me.

> cls
>
Received on Wed Jun 01 2016 - 18:03:43 CEST

This archive was generated by hypermail 2.3.0 : Wed Jun 01 2016 - 18:12:10 CEST