[dev] off topic - awk versions performance comparison from Joseph Xu on 2010-08-11 (dev mail list archive)

From: Joseph Xu <josephzxu_AT_gmail.com>
Date: Wed, 11 Aug 2010 06:14:55 -0400

Hi everyone,

I was playing around with awk and ran into some surprising data for the
function match(s, r), which returns the position of the first match of
regular expression r in string s. Here's the test I ran:

$ yes | head -10000 | tr -d '\n' >/tmp/big
$ yes | head -1000000 | tr -d '\n' >/tmp/bigger
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real 0m0.056s
user 0m0.053s
sys 0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real 0m5.695s
user 0m5.140s
sys 0m0.553s

The difference is almost exactly 100x, which is the size difference
between the input files. It seems ridiculous that the amount of time
taken to match the first character of a string grows linearly with the
size of the string. The time it takes to load the contents of the file
does not contribute significantly to this increase.

That was GNU awk. Trying p9p awk:

$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real 0m0.024s
user 0m0.023s
sys 0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real 0m1.856s
user 0m1.857s
sys 0m0.000s

This is a lot faster compared to GNU awk but there's still a 77x
difference between the two input files.

Finally, trying Kernighan's One True Awk (from
http://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz)

$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real 0m0.001s
user 0m0.000s
sys 0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real 0m0.015s
user 0m0.010s
sys 0m0.003s

The time difference here is actually only due to loading a larger file:

$ time nawk '{ }' /tmp/big

real 0m0.001s
user 0m0.000s
sys 0m0.000s
$ time nawk '{ }' /tmp/bigger

real 0m0.016s
user 0m0.013s
sys 0m0.003s

So at least nawk's performance makes sense. To make things a little more
confusing, I tried matching on a non-existent pattern:

$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.005s
user 0m0.003s
sys 0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m0.085s
user 0m0.080s
sys 0m0.003s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.006s
user 0m0.003s
sys 0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m0.077s
user 0m0.077s
sys 0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.007s
user 0m0.007s
sys 0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m0.612s
user 0m0.607s
sys 0m0.003s

Now GNU awk and p9p awk are showing on the order of 10x differences
between the two files, while nawk shows 100x difference. Obviously the
regex algorithm that nawk uses is different from the other two, and
there's a trade off here, though nawk still looks superior.

Last test:

$ echo -n 'z' >>/tmp/big
$ echo -n 'z' >>/tmp/bigger

Now the pattern does match, but at the very end of the string.

$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.056s
user 0m0.057s
sys 0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m5.861s
user 0m5.270s
sys 0m0.590s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.057s
user 0m0.057s
sys 0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m4.760s
user 0m4.756s
sys 0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real 0m0.007s
user 0m0.007s
sys 0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real 0m0.612s
user 0m0.610s
sys 0m0.003s

In this case all the numbers make sense, assuming a linear scan is going on.

Can someone shed some light on these vastly different numbers? Why do
GNU awk and p9p awk perform so poorly on the easy case where they just
have to match the first character?
Received on Wed Aug 11 2010 - 12:14:55 CEST

This archive was generated by hypermail 2.2.0 : Wed Aug 11 2010 - 12:24:02 CEST