Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

From: <random832_AT_fastmail.us>
Date: Sat, 10 Jan 2015 18:55:01 -0500

On Fri, Jan 9, 2015, at 18:39, FRIGN wrote:
> C3B6 is 'ö' and makes sense to allow specifying it as \50102 (in the pure
> UTF-8-sense of course, nothing to do with collating).

Why would someone want to use the decimal value of the UTF-8 bytes,
rather than the unicode codepoint?

Why are you using decimal for a syntax that _universally_ means octal?

UTF-8 is an encoding of Unicode. No-one actually thinks of the character
as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6
or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer
unit.

The sensible thing to do would be to extend the syntax with \u00F6 (and
\U00010000 for non-BMP characters) the way many other languages have
done it) This also avoids repeating the mistake of variable-length
escapes - \u is exactly 4 digits, and \U is exactly 8.

> Well, probably I misunderstood the matter. Sometimes this stuff gets
> above my head. ;)
> At the end of the day, you want software to work as expected:
>
> GNU tr:
> $ echo ελληνική | tr [α-ω] [Α-Ω]
> ®®®®®®®®®
>
> our tr:
> $ echo ελληνικη | ./tr [α-ω] [Α-Ω]
> ΕΛΛΗΝΙΚΗ

And that's fine. Actually I think POSIX actually _requires_ for it to
work the way yours does, and GNU fails to comply. As a data point, OSX
and FreeBSD both work the same way as sbase for this test case.

GNU actually has a history of being behind the curve on UTF-8/multibyte
characters, so it's not a great example of "what POSIX requires". Cut is
another notable command with the same problem.
Received on Sun Jan 11 2015 - 00:55:01 CET

This archive was generated by hypermail 2.3.0 : Sun Jan 11 2015 - 01:00:10 CET