Re: [dev] sple - A simple PDF links/emails extracotr.

From: Jason Woofenden <>
Date: Thu, 7 May 2015 10:09:26 -0400

Well, surprise, surprise, I was thinking about sed instead of
falling asleep for a bit last night.

I was excited about finally finding a way to print out multiple
(possibly transformed) matches per line, but my solution was messy.
Also, it was a little buggy (zero-length outputs broke it.)

Here's the cleanest thing I could find today:

        sed -ne 's/THING/\n\0\n/g' -e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'

So to print out all matches, each on their own line (even if
there's multiple matches on the same line) you do your normal
s/// and add \n on either side of what you want printed, then you
add these sed args:

        -e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'

So for extracting links from a pdf, that would be:

        pdftohtml -stdout foo.pdf | sed -ne 's/href="\([^"]\+\)"/\n\1\n/g' -e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'

If you've got GNU sed, you can replace the "' -e '" in the middle
with ";", or for a little speed boost: ";T;".

On 2015-05-06 11:19PM, Jason Woofenden wrote:
> Hi Hypsurus,
> I hope you're having fun coding. Don't let me detract from that.
> But if you just need to extract links from pdfs, you can do so with
> existing tools, eg:
> pdftohtml -stdout foo.pdf | sed -ne 's/\(^\|\n\)\n\([^\n]*\)\n[^\n]*/\1\2/gp; t; s/href="\([^"]\+\)"/\n\n\1\n/g; D'
> Sorry if that sed thing is more complex than it needs to be. I'm
> just learning the other sed commands besides s///.
> The extra complexity with the "\n"s is to handle multiple links on
> the same line.
> -- 
> Jason
Received on Thu May 07 2015 - 16:09:26 CEST

This archive was generated by hypermail 2.3.0 : Thu May 07 2015 - 16:12:08 CEST