Re: [hackers] A better mailing list web archiver for ... ?

From: Thomas Oltmann <>
Date: Fri, 12 Aug 2022 14:57:19 +0200

On Fri, Aug 12, 2022 at 7:35 AM Storkman <> wrote:
> On Wed, Aug 10, 2022 at 09:29:43PM +0200, Thomas Oltmann wrote:
> > Hi all!
> >
> > I think we can all agree that the current web archive over at
> > isn't all that great;
> > Author names get mangled, the navigation is terrible, some messages
> > are duplicated, some missing.
> >
> > That's why I've started looking into #3 of the 'Project Ideas' page
> > ( -- "Write a decent mailing list
> > Web archive system".
> > I see lots of potential to build something better than hypermail:
> >
> > - We could take text encodings more seriously.
> > hypermail just copies the 'charset' notice over into the HTML
> > file, which doesn't work when listing multiple messages.
> >
> > - We could use maildir instead of the really brittle mbox format for mailboxes.
> > This might also help avoid message dropping/duplication, but I'm not
> > sure about that.
> >
> > - We could try a different navigation scheme. Perhaps flat threads
> > instead of a hierarchy?
> > I don't really know how people here feel about this, but it's
> > mentioned on the 'Project Ideas' page
> > and I'm in favour of it. Navigating message trees is really confusing.
> >
> > - Bonus: We can ignore CGI, uuencode, HTML mail and all that cruft.
> >
> > Is there currently any interest in such a project here?
> >
> > So far, I've gone ahead and implemented a sort of proof-of-concept (at
> >
> > Of course I can't guarantee that this will go anywhere, as I only have
> > limited time and patience myself, but I can give it a try.
> >
> > Cheers,
> > Thomas Oltmann
> >
> Hi!
> When you list all these features, it sounds like everything a mailing list
> archive front-end does just replicates things our mail clients already
> do better, and without going through a web browser.
> So I thought, why not just serve the maildir files as-is, with monthly
> and yearly tarballs, and perhaps metadata files so you don't need to
> download everything just to make sure you've got an entire thread?
> But then, that would require additional instrumentation and would make e.g.
> referencing mailing list threads in commit messages slightly less convenient.

There's some overlap in functionality with mail clients, yes, but the
big difference IMO is that
a mail archiver aggregates the mail traffic and turns it into proper
*documents* that can easily be _viewed_, _distributed_ *and*
_referenced_ by anyone.
It doesn't matter what kind of format these documents actually are -
HTML, plain text, PDF, whatever.
For example, if a newbie asks "Help, I can't apply the dwm-alpha
patch" you just want to be able to
give him a link to the last time this was answered.
Similarly, when you write a blog entry referencing recent discussions
on the mailing list,
having some link or document that you can put in your references is great.

But aside from that, additionally distributing a tarball might be a
really good idea for long-term archival.
tar and RFC822 have been here for 50 years and will likely stay for another 100.

> In any case, I messed with the code a bit, running it on my own archive
> maildir. I've constructed a very crude threaded view[1], and came up with a
> few fixes in the process.
> Patch 2 is a rewrite of collapse_ws(), because I found it really hard to
> figure out what exactly it does and how. Your mileage may vary, but I
> think the original code would overflow the buffer backwards when given
> an empty input.
> For patch 3, I've found some e-mails in the wild that used a lowercase
> encoding in encoded-words, and the RFC says it's okay.
> Patch 4 might not be correct, because I'm not sure how decode_qprintable()
> can ever return without error when parsing an encoded-word in a header.
> It seems that it would just find the last "=" in "?=", set length to -2,
> and return NULL. Maybe I'm just not getting it. It did manage to process
> a few dozen more e-mails in my test runs, though.
> Hopefully I did this correctly and you can cherry-pick these commits
> to your taste.

Thanks a bunch. I'll take a closer look at your patches when I find some time.

> -- Storkman
> [1]:
Received on Fri Aug 12 2022 - 14:57:19 CEST

This archive was generated by hypermail 2.3.0 : Fri Aug 12 2022 - 23:48:37 CEST