Re: [dev] [PATCH][RFC] Add a basic version of tr

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Wed, 15 Jan 2014 22:32:28 +0100

On Wed, Jan 15, 2014 at 09:36:07PM +0100, Szabolcs Nagy wrote:
> * Silvan Jegen <s.jegen_AT_gmail.com> [2014-01-15 20:43:54 +0100]:
> > Note, though, that GNU's tr does not seem to handle Unicode at all[1]
> > while this version of tr, according to "perf record/report", seems to
> > spend most of its running time in the Unicode handling functions of glibc.
>
> multi-byte string decoding is known to be slow in glibc
>
> eg see the utf8 decoding benchmark in
> http://www.etalabs.net/compare_libcs.html

I installed musl libc and used musl-gcc to compile this tr implementation
(no change in the code necessary). Using the same input file I get the
following numbers:

real 0m2.690s
user 0m2.597s
sys 0m0.187s

real 0m2.644s
user 0m2.590s
sys 0m0.143s

real 0m2.648s
user 0m2.543s
sys 0m0.200s

That's actually quite impressive.


> > By no means was this any serious benchmarking but eliminating the function
> > pointer did not seem to make an obvious difference.
>
> note that recent gcc (4.7?) can do function pointer inlining
> if it can infere that the function is in the same tu
> (and with lto it can probably do cross-tu inlining)
>
> > +void
> > +handleescapes(char *s)
> > +{
> > + switch(*s) {
> > + case 'n':
> > + *s = '\x0A';
> > + break;
> > + case 't':
> > + *s = '\x09';
> > + break;
> > + case '\\':
> > + *s = '\x5c';
>
> what's wrong with '\n' etc here?

I am not sure what you mean. My interpretations:

1. Why no '\n' in the case statements?

I don't think that's possible but I could be wrong.


2. Why are you escaping '\n'?

Because I assume that the user wants to replace/delete the newlines (resp.
tabs) from the input if he puts '\n' (resp. '\t') into the first character
set argument.


> btw a fully posix conformant tr implementation is available here:
> http://git.musl-libc.org/cgit/noxcuse/tree/src/tr.c

Looks interesting but I would have to have a longer look (and I catched
a cold so that has to wait...). I noticed that it uses the threadsafe
version of the mbtowc function. Do you think that is advisable in
general?
Received on Wed Jan 15 2014 - 22:32:28 CET

This archive was generated by hypermail 2.3.0 : Wed Jan 15 2014 - 22:36:07 CET