Re: [dev] off topic - awk versions performance comparison from Robert Ransom on 2010-08-11 (dev mail list archive)

From: Robert Ransom <rransom.8774_AT_gmail.com>
Date: Wed, 11 Aug 2010 04:01:27 -0700

On Wed, 11 Aug 2010 06:41:30 -0400
Kris Maglione <maglione.k_AT_gmail.com> wrote:

> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
> > Hi everyone,
> >
> > I was playing around with awk and ran into some surprising data for the
> > function match(s, r), which returns the position of the first match of
> > regular expression r in string s. Here's the test I ran:
> >
> > $ yes | head -10000 | tr -d '\n' >/tmp/big
> > $ yes | head -1000000 | tr -d '\n' >/tmp/bigger
> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
> >
> > real 0m0.056s
> > user 0m0.053s
> > sys 0m0.000s
> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
> >
> > real 0m5.695s
> > user 0m5.140s
> > sys 0m0.553s
> >
> > The difference is almost exactly 100x, which is the size difference
> > between the input files. It seems ridiculous that the amount of time
> > taken to match the first character of a string grows linearly with the
> > size of the string. The time it takes to load the contents of the file
> > does not contribute significantly to this increase.
>
> You don't make sense. The second test performs the match two
> orders of magnitude more times, so it should be two orders of
> magnitude slower. It's not comparing a single string that is 100
> times as long, but rather running the same test 100 times for
> either 10⁴ or 10⁶ identical strings.

No, he stripped out the newlines with tr.

Robert Ransom

application/pgp-signature attachment: signature.asc

Received on Wed Aug 11 2010 - 13:01:27 CEST

This archive was generated by hypermail 2.2.0 : Wed Aug 11 2010 - 13:12:02 CEST