Re: [dev] alternatives to find for querying the filesystem

From: Andrew Gwozdziewycz <web_AT_apgwoz.com>
Date: Thu, 12 Dec 2013 17:14:28 -0500

On Thu, Dec 12, 2013 at 3:36 PM, Troels Henriksen <athas_AT_sigkill.dk> wrote:
> Troels Henriksen <athas_AT_sigkill.dk> writes:
>
>> Andrew Gwozdziewycz <web_AT_apgwoz.com> writes:
>>
>>> Assume that each filter halves the fileset of, say, 256 files (my /etc
>>> directory on this OSX machine has just 247 files). That's less than
>>> 512 calls with a few filters. Is that really so bad on modern
>>> hardware?
>>
>> If you have only 256 files, you can do almost anything and it'll still
>> be fast. Think tens of thousands, at the very least.
>>
>> I like the general idea of this, but I'd advise you not to go overboard
>> with the tool splitting. You can easily have a single "filegrep" tool
>> that includes all queries about size, age and the like, paired with
>> another tree-walking fool.
>
> Actually, on second thought, there's no way to make this fast. You
> *must* be able to perform filtering during tree traversal, or you may
> end up traversing huge subdirectories unnecessarily.

Is it possible for find to eliminate huge subdirectories? It seems to
have the exact same problem. If your "query" is at the file level,
ain't nothing you can do. You could modify walk with a way to exclude
directories of course.

     walk -e var/run/ -e bin/

Again, part of this is that you potentially know the file system
better than find does, so you can construct a pipeline that eliminates
files more quickly. Of course, that means you're going to end up
typing a lot every time:

      walk /var | ownerp -u ~root | sizep +1M

And, now that I think about it, walk never has to stat the files,
which means you can do a fair amount of filtering before making any
system calls that aren't directory related.

Also, just for kicks I ran a comparison:

$ time find / | grep 'bin' > /dev/null
real 0m8.122s
user 0m3.101s
sys 0m2.519s

$ time find / -regex 'bin' | grep
real 0m18.795s
user 0m3.394s
sys 0m3.401s

This is on a very recent Macbook Pro, so SSD and all that jazz. I did
this a few times. Numbers are similar through every run. It's possible
that this is being caused by find's regex engine (which might suck
horribly).

So, I'm perhaps *not* convinced that this would always be slower.

-- 
http://apgwoz.com
Received on Thu Dec 12 2013 - 23:14:28 CET

This archive was generated by hypermail 2.3.0 : Thu Dec 12 2013 - 23:24:06 CET