Well, surprise, surprise, I was thinking about sed instead of
falling asleep for a bit last night.
I was excited about finally finding a way to print out multiple
(possibly transformed) matches per line, but my solution was messy.
Also, it was a little buggy (zero-length outputs broke it.)
Here's the cleanest thing I could find today:
sed -ne 's/THING/\n\0\n/g' -e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'
So to print out all matches, each on their own line (even if
there's multiple matches on the same line) you do your normal
s/// and add \n on either side of what you want printed, then you
add these sed args:
-e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'
So for extracting links from a pdf, that would be:
pdftohtml -stdout foo.pdf | sed -ne 's/href="\([^"]\+\)"/\n\1\n/g' -e 's/\(^[^\n]*\n\|\(\n\)\)\([^\n]*\)\n[^\n]*/\2\3/gp'
If you've got GNU sed, you can replace the "' -e '" in the middle
with ";", or for a little speed boost: ";T;".
--
Jason
On 2015-05-06 11:19PM, Jason Woofenden wrote:
> Hi Hypsurus,
>
> I hope you're having fun coding. Don't let me detract from that.
> But if you just need to extract links from pdfs, you can do so with
> existing tools, eg:
>
> pdftohtml -stdout foo.pdf | sed -ne 's/\(^\|\n\)\n\([^\n]*\)\n[^\n]*/\1\2/gp; t; s/href="\([^"]\+\)"/\n\n\1\n/g; D'
>
> Sorry if that sed thing is more complex than it needs to be. I'm
> just learning the other sed commands besides s///.
>
> The extra complexity with the "\n"s is to handle multiple links on
> the same line.
>
> --
> Jason
>
Received on Thu May 07 2015 - 16:09:26 CEST