On Sat, 10 Jan 2015 18:55:01 -0500
random832_AT_fastmail.us wrote:
> Why would someone want to use the decimal value of the UTF-8 bytes,
> rather than the unicode codepoint?
Because it sadly is specified like this in the tr-document.
> Why are you using decimal for a syntax that _universally_ means octal?
It was an example to extend this "decimal" idea to UTF-8, but I totally
agree with you that octal is a saner way to go.
> UTF-8 is an encoding of Unicode. No-one actually thinks of the character
> as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6
> or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer
> unit.
Well I do since I wrote the algorithm, however, what you probably mean is
the matter of how they're expressed as input.
> The sensible thing to do would be to extend the syntax with \u00F6 (and
> \U00010000 for non-BMP characters) the way many other languages have
> done it) This also avoids repeating the mistake of variable-length
> escapes - \u is exactly 4 digits, and \U is exactly 8.
If they're fixed length, they could be implemented.
> GNU actually has a history of being behind the curve on UTF-8/multibyte
> characters, so it's not a great example of "what POSIX requires". Cut is
> another notable command with the same problem.
No wonder why it's behind <.<. They can't even maintain their codebases
properly.
Cheers
FRIGN
--
FRIGN <dev_AT_frign.de>
Received on Sun Jan 11 2015 - 11:22:04 CET