Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

From: FRIGN <>
Date: Sun, 11 Jan 2015 11:22:04 +0100

On Sat, 10 Jan 2015 18:55:01 -0500 wrote:

> Why would someone want to use the decimal value of the UTF-8 bytes,
> rather than the unicode codepoint?

Because it sadly is specified like this in the tr-document.

> Why are you using decimal for a syntax that _universally_ means octal?

It was an example to extend this "decimal" idea to UTF-8, but I totally
agree with you that octal is a saner way to go.

> UTF-8 is an encoding of Unicode. No-one actually thinks of the character
> as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6
> or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer
> unit.

Well I do since I wrote the algorithm, however, what you probably mean is
the matter of how they're expressed as input.

> The sensible thing to do would be to extend the syntax with \u00F6 (and
> \U00010000 for non-BMP characters) the way many other languages have
> done it) This also avoids repeating the mistake of variable-length
> escapes - \u is exactly 4 digits, and \U is exactly 8.

If they're fixed length, they could be implemented.

> GNU actually has a history of being behind the curve on UTF-8/multibyte
> characters, so it's not a great example of "what POSIX requires". Cut is
> another notable command with the same problem.

No wonder why it's behind <.<. They can't even maintain their codebases



Received on Sun Jan 11 2015 - 11:22:04 CET

This archive was generated by hypermail 2.3.0 : Sun Jan 11 2015 - 11:24:13 CET