Re: [dev] another text user interface for sam from Connor Lane Smith on 2016-03-02 (dev mail list archive)

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Wed, 2 Mar 2016 00:12:58 +0000

On 1 March 2016 at 20:35, Marc André Tanner <mat_AT_brain-dump.org> wrote:
> As an example, swapping two words with
>
> ,x[a-zA-Z]+/{
> g/fred/ v/...../ c/jim/
> g/jim/ v/..../ c/fred/
> }
>
> which is mentioned in both the sam tutorial and the cheatsheet would
> no longer work.

That's true; I suppose it depends how useful that is in general. It
might be nice to have both behaviours really, one 'seq' and one 'par'
(cf. ';' and '&' in sh). The former would suffice for most use cases,
and would be much faster. The drawback of this being that we'd have to
implement the parallel version too anyway (though we could leave that
for later).

On 1 March 2016 at 20:56, Marc André Tanner <mat_AT_brain-dump.org> wrote:
> Misunderstanding on my part I guess. These are just limitations of
> the regex(3) API but have no particular relevance for *structural*
> regular expressions. I.e. it is possible to build a structural
> regexp library on top of this crippled interface it probably just
> wont be particularly efficient.

Fair enough. It has those limitations in common with every other
regular expression library API I know of, though. Of course, it would
be theoretically possible to build structural regular expressions on
top of regex(3), but it would be an abomination.

> For others wanting to take a look, I guess you are referring to
> the following papers:
>
> - Two-pass greedy regular expression parsing
> - Optimally Streaming Greedy Regular Expression Parsing

Yes indeed.

> What is in your opinion the current state of the art technique/algorithm?

Well, I think the 'two-pass' approach, described in the first of those
papers, is particularly interesting. There's been a big resurgence of
interest in Thompson's VM-style approach to regular expressions,
probably mostly because of Russ Cox's excellent write up on them.
Which is fantastic, because it's clearly superior to the backtracking
approach. But all of that, all of the benefits, are sort of thrown out
the window as soon as you start looping over the matches of a
sufficiently nondeterministic expression. Consider the following two
structural regular expressions, which in this file have identical
matches:

    $ time ssam 'x/./' rc.1 >/dev/null
    real 0m0.020s
    user 0m0.008s
    sys 0m0.004s

    $ time ssam 'x/(.*\n)+\n|./' rc.1 >/dev/null
    real 0m9.730s
    user 0m9.672s
    sys 0m0.048s

The matches are identical, and yet the latter takes a vastly greater
amount of time -- quadratic to the size, in fact. I was originally
going to use my dictionary file as the example, but it never finished
and I had to kill sam. Yet, using the 'two-pass' approach, this could
be done in linear time. The 'lean log' they use in their paper would
take a bit more space, about one bit per byte for the above example,
but I think if we do it per block we should be able to cut down on
that massively, perhaps as little as one bit per state per block,
which would obviously be much more manageable.

I haven't thought so much about the 'optimally streaming' part yet. I
don't think it would be necessary to go to the lengths they do in
their paper, since we don't actually need strict optimality, but it
would be nice if we could reduce the amount of data we have to read in
order to get a match. One nice benefit of this is it would strike off
the bug identified in ssam(1): "Ssam consumes all of standard input
before running the script." If we were to stream data through a fairly
lazy regular expression engine then this would hopefully be pretty
easy to fix in a future ssam implementation.

Thanks,
cls
Received on Wed Mar 02 2016 - 01:12:58 CET

This archive was generated by hypermail 2.3.0 : Wed Mar 02 2016 - 01:24:12 CET