Re: [dev] off topic - awk versions performance comparison

From: Joseph Xu <josephzxu_AT_gmail.com>
Date: Wed, 11 Aug 2010 15:06:13 -0400

On 08/11/2010 07:10 AM, Kris Maglione wrote:
> On Wed, Aug 11, 2010 at 04:01:27AM -0700, Robert Ransom wrote:
>> On Wed, 11 Aug 2010 06:41:30 -0400
>> Kris Maglione <maglione.k_AT_gmail.com> wrote:
>>
>>> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
>>>> Hi everyone,
>>>>
>>>> I was playing around with awk and ran into some surprising data for the
>>>> function match(s, r), which returns the position of the first match of
>>>> regular expression r in string s. Here's the test I ran:
>>>>
>>>> $ yes | head -10000 | tr -d '\n' >/tmp/big
>>>> $ yes | head -1000000 | tr -d '\n' >/tmp/bigger
>>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
>>>>
>>>> real 0m0.056s
>>>> user 0m0.053s
>>>> sys 0m0.000s
>>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
>>>>
>>>> real 0m5.695s
>>>> user 0m5.140s
>>>> sys 0m0.553s
>>>>
>>>> The difference is almost exactly 100x, which is the size difference
>>>> between the input files. It seems ridiculous that the amount of time
>>>> taken to match the first character of a string grows linearly with the
>>>> size of the string. The time it takes to load the contents of the file
>>>> does not contribute significantly to this increase.
>>>
>>> You don't make sense. The second test performs the match two
>>> orders of magnitude more times, so it should be two orders of
>>> magnitude slower. It's not comparing a single string that is 100
>>> times as long, but rather running the same test 100 times for
>>> either 10⁴ or 10⁶ identical strings.
>>
>> No, he stripped out the newlines with tr.
>
> He might have, if he used GNU tr. However, in that case, my
> results are nowhere as dramatic as his.
>

I was using GNU tr. The input files were single lines with 10000 or 1000000 y's, so I was doing 100 matches in each case (from the for loop) on the same line. I guess I should have made that more explicit, sorry. I'm interested in what results you're getting.
Received on Wed Aug 11 2010 - 21:06:13 CEST

This archive was generated by hypermail 2.2.0 : Wed Aug 11 2010 - 21:12:02 CEST