Re: [dev] freetype2/fc pain from Laslo Hunhold on 2018-09-26 (dev mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Wed, 26 Sep 2018 01:06:52 +0200

On Tue, 25 Sep 2018 21:25:12 +0000
sylvain.bertrand_AT_gmail.com wrote:

Dear Sylvain,

> An unicode string has 4 canonical normalizations. But only one (NFD)
> seems to be futur proof regarding what features will be supported by
> font files (opentype(microsoft tm)/open font format).

this is true, as only the full decomposition is even remotely
"mendable". All the other canonical forms are a nightmare.

The upside to this is that these normalizations are only really
relevant when you do text operations like comparisons and other things.
The way codepoints are laid out does not immediately say anything about
the drawn size of a glyph, which will require other means.

st for this matter does and in a way should not mess with normalization
and Unicode fortunately defined glyph-boundaries relatively simply. I
have a very minimalistic library on my hard drive idling for almost 3
years now (Roberto and I hacked on it at Budapest after slcon2 in
2015).

> Ofc, this is the one canonical normalizations which hard depends on
> harfbuzz shaping in freetype. For instance the glyph 'é' won't be
> anymore 1 glyph (a "pre-combined" glyph) in the font file but will be
> the combined rendering of 'e' + 'combining accent' glyphs which only
> harbuzz understands and not freetype alone. Font designers are pushed
> to avoid making "pre-combined" glyphs: pre-combined glyphs are not
> allowed in unicode anymore (actually, it has been the case for quite
> some time). And that's the simple case of combined glyphs...

This is the true issue, yes, and this whole concept is way ahead of the
technological ecosystem. Only now have people begun to respect
"codepoints" and not just the code units themselves. However, what
Unicode preaches is that a glyph is free to be composed of arbitrarily
many codepoints, complicating the whole manner a lot.

> Additionally, xml smile/svg vector rendering was introduced in the
> otf/ttf font format with animated color emojis: A futur "clean" pure
> xml font format is lurking on the horizon (open type 2?).

We should ignore this nonsense.

> The unicode canonical normalization also affects input: the
> application won't receive anymore 1 unicode code point for a
> "pre-combined" symbol 'é', but 2 unicode code points 'e' + 'combining
> accent'.

This is not a problem with st, but a general "issue" in text
processing. I heavily researched this a while back and if we e.g. went
100% with that in sbase, we would always have to have a normalizer
running in the background. I wrote a simple parser in awk(1) that takes
the Unicode-data and turns it into a LUT for NFD-processing, but it all
complicates things a lot.

I understand why they did it like this, but this is UTF16 all over
again where people, given the lack of surrogate characters in common
input data, made the mistake to always assume a code point to only
consist of 16 bits while all these surrogate characters can actually be
composed of two 16 bit units.

> st is surrounded.

I wouldn't overdramatize this. The terminal emulation backend couldn't
care less about Unicode and all that and the robustness of UTF-8 allows
us to just carry on. The real problem is to "judge" how much space the
given data is going to take in the drawing step, not to forget about
the huge problem with the font drawing library.

As you already mentioned, having all this NFD-combining-mess definitely
complicates the process of font-drawing compared to just having these
characters already "ready" for use.

> The suckless futur proof solution: it is over, st goes 7bits ascii
> only with it's own bitmap fonts... non english-only terminal users
> will just trash it.
>
> ... or a suckless futur proof unicode/font stack will have to be
> coded:
> - unicode normalizer (NFD) (like ICU)

ICU is a dead end, as it loads localized data on the fly. The
normalizer, if implemented, would only use the "global" tables.
Such a normalizer would not be necessary for st though and we would
only need a tool to count glyphs, which I've already done.

> - a full xml smile/svg vector renderer (like librsvg/expat for
> the svg part)

No, forget about SVG fonts. Nobody sane would think about implementing
this while keeping simplicity and security in mind.

> - a ttf/otf -> xml svg translator (in freetype).

There's no need to translate to SVG. TTF/OTF is actually a quite
convenient vector format and if one were to develop a
font-rendering-library, he would want to split up the tasks into three
steps:

   1) Parsing TTF/OTF files
   2) Assembling vector drawing instructions (hardest part)
   3) Rasterization (watch out for patents here)

> ... or st becomes like surf: an app which is a thin suckless wrapper
> around a huge pile of ... You know what: st would be better of being
> a thin wrapper around libvte then, because it would be even thiner.

We shouldn't throw the baby out with the bathwater, in my opinion.
There is lots of pent up frustration out there about
freetype/fontconfig and there are relatively simple solutions that
could be a starting point for a solid homegrown solution.

I hope it does not sound like NiH syndrome, but the madness needs to
stop and freetype/fontconfig is a horrible security hole. The only
thing you really need for a font-database is a list of fonts in
descending order (i.e. a fallback-array). The API for such a library,
lets call it sfl (suckless font library), would be very simple:

   struct sfl { ... };

   sfl_init(struct sfl *s, char **files, size_t nfiles);
   sfl_draw(...);
   sfl_free(struct sfl *s);

Some functionalities, like getting the "length" of the drawn string,
can be realized by e.g. passing NULL for the drawing surface in
sfl_draw(), no matter now how we implement it in detail.

But this is just theory. I didn't have time to study the TTF/OTF
formats but am sure that we should not just give up on this topic. It
just doesn't sound right to recommend people to use UTF-8 while
disregarding 25 years of this development and non-English languages.

With best regards

Laslo

-- 
Laslo Hunhold <dev_AT_frign.de>

Received on Wed Sep 26 2018 - 01:06:52 CEST

This archive was generated by hypermail 2.3.0 : Wed Sep 26 2018 - 01:12:07 CEST