Hi Hypsurus,
I hope you're having fun coding. Don't let me detract from that.
But if you just need to extract links from pdfs, you can do so with
existing tools, eg:
pdftohtml -stdout foo.pdf | sed -ne 's/\(^\|\n\)\n\([^\n]*\)\n[^\n]*/\1\2/gp; t; s/href="\([^"]\+\)"/\n\n\1\n/g; D'
Sorry if that sed thing is more complex than it needs to be. I'm
just learning the other sed commands besides s///.
The extra complexity with the "\n"s is to handle multiple links on
the same line.
--
Jason
Received on Thu May 07 2015 - 05:19:04 CEST