Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way from Markus Wichmann on 2015-01-10 (dev mail list archive)

From: Markus Wichmann <nullplan_AT_gmx.net>
Date: Sat, 10 Jan 2015 22:47:09 +0100

On Sat, Jan 10, 2015 at 08:51:03PM +0100, FRIGN wrote:
> On Sat, 10 Jan 2015 02:52:09 +0100
> "Dmitrij D. Czarkoff" <czarkoff_AT_gmail.com> wrote:
>
> > > +#define UPPER "A-Z"
> > > +#define LOWER "a-z"
> > > +#define PUNCT "!\"#$%&'()*+,-./:;<=>?_AT_[\\]^_`{|}~"
> >
> > These definitions hugely misrepresent corresponding character classes.
>
> I interpreted the character classes by default for the C locale. What do
> you mean by hugely misrepresenting? They are just fragments to build the
> classes later on.
>

You wanted to be Unicode compatible, right? Because in that case I
expect [:alpha:] to be the class of all characters in General Category L
(that is, Lu, Ll, Lt, Lm, or Lo). That includes a few more characters
than just A-Z and a-z. And I don't see you add any other character to
that class later.

So, what I'm saying is, you can't have it both ways: Either you support
Unicode or not.

Regarding implementation: That is going to be tricky, considering that
the characters fitting the various classes are strewn across the Unicode
code range. And of course, it would routinely use up way more memory by
using code points from further back in the code range, thus using more
of the map.

I really don't see a way to achieve this without including a database of
sorts into tr itself. Because other than that, the only thing available
is the character classification functions from C99 (iswalpha() et al.),
which only provide you with one bit of information: Whether a given
codepoint is in a category... wait, this can work! If we had a variable
iterate from 1 to Unicode maximum and call iswalpha() for every one,
we'd get the set of all alphabetic characters. Can this work for us?

Ciao,
Markus
Received on Sat Jan 10 2015 - 22:47:09 CET

This archive was generated by hypermail 2.3.0 : Sat Jan 10 2015 - 22:48:07 CET