Re: [dev] off topic - awk versions performance comparison

From: Kris Maglione <maglione.k_AT_gmail.com>
Date: Wed, 11 Aug 2010 07:10:53 -0400

On Wed, Aug 11, 2010 at 04:01:27AM -0700, Robert Ransom wrote:
>On Wed, 11 Aug 2010 06:41:30 -0400
>Kris Maglione <maglione.k_AT_gmail.com> wrote:
>
>> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
>> > Hi everyone,
>> >
>> > I was playing around with awk and ran into some surprising data for the
>> > function match(s, r), which returns the position of the first match of
>> > regular expression r in string s. Here's the test I ran:
>> >
>> > $ yes | head -10000 | tr -d '\n' >/tmp/big
>> > $ yes | head -1000000 | tr -d '\n' >/tmp/bigger
>> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
>> >
>> > real 0m0.056s
>> > user 0m0.053s
>> > sys 0m0.000s
>> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
>> >
>> > real 0m5.695s
>> > user 0m5.140s
>> > sys 0m0.553s
>> >
>> > The difference is almost exactly 100x, which is the size difference
>> > between the input files. It seems ridiculous that the amount of time
>> > taken to match the first character of a string grows linearly with the
>> > size of the string. The time it takes to load the contents of the file
>> > does not contribute significantly to this increase.
>>
>> You don't make sense. The second test performs the match two
>> orders of magnitude more times, so it should be two orders of
>> magnitude slower. It's not comparing a single string that is 100
>> times as long, but rather running the same test 100 times for
>> either 10⁴ or 10⁶ identical strings.
>
>No, he stripped out the newlines with tr.

He might have, if he used GNU tr. However, in that case, my
results are nowhere as dramatic as his.

-- 
Kris Maglione
There are two ways to write error-free programs; only the third works.
	--Alan J. Perlis
Received on Wed Aug 11 2010 - 13:10:53 CEST

This archive was generated by hypermail 2.2.0 : Wed Aug 11 2010 - 13:12:03 CEST