Re: [dev] suckless html to markdown (text)

From: Nick <suckless-dev_AT_njw.me.uk>
Date: Sun, 6 Jan 2019 22:22:00 +0000

Quoth Alexander Krotov:
> > Ideally, with sed/awk, or better in C.
>
> "Parsing" HTML with sed is simply wrong.

This is a good point that I should have mentioned. I spent years
using sed and awk to extract things from HTML, writing crawlers and
suchlike, for personal projects. It can work, of course, but tends
to be very obfuscated and fragile. I haven't needed to do any such
crawling for a while now (and often the data is easier to access as
json, an unexpected side-effect of the horrors of javascript
overuse), but if I needed to I'd likely look into using something
like go's html parsing these days. I'd rather have something
slightly slower that's more robust and reusable, really. awk is a
good fit for line-based parsing, and sed is good for stream
transformation, neither work well for parsing machine-generated
mountains of HTML of the sort that dominates the web today.
Received on Sun Jan 06 2019 - 23:22:00 CET

This archive was generated by hypermail 2.3.0 : Sun Jan 06 2019 - 23:24:07 CET