On Thu, 27 Sep 2018 19:40:06 +0000
sylvain.bertrand_AT_gmail.com wrote:
Dear Sylvain,
> I did dive a bit deeper in latest unicode, and it's even worst of
> what I thought.
> To deal with real unicode input/output and to split it in "extended
> graphem clusters" (an unicode "char"), you need a finite state
> machine (I guess that's what Lalso was referering to). And it's the
> same for the "line returns" handling.
it depends on how you implement it. The way I did it was to offer a
function
int bound(uint32_t a, uint32_t b)
which returns 1 if a and b form a grapheme cluster boundary and 0 if a
and b are not. In a stream-based setting you would have the following
layers, starting with the raw ASCII input:
1) UTF-8-decoding (into uint32_t codepoints)
2) Grapheme cluster detection using bound() based on the uint32_t
codepoints
The function bound() just operates on relatively small LUTs and is
pretty efficient. If we implement a font drawing library in some way,
we will have to think about how we do this special handling right.
Extended grapheme clusters fortunately really stand for themselves and
can be a good "atom" to base font rendering on.
No matter how we draw it in the "raster" at the end, it would already
be a big step for st to have an "idea" of what the raw input really
"means" in the drawn state.
> Additionnaly, unicode NFC normalization is kind of useless (the one
> chosen for the web), since they have forbidden pre-combined glyph for
> a long time, you end up implementing NFD stuff anyway (that move was
> obviously malicious).
Yes, NFD is the only "sane" choice.
> So, the real culprits are actually written languages: they suck.
> Namely, you cannot write suckless code for tons of written languages,
> and on top of that, simple written languages handling being
> generalized with some of the most complex written languages, handling
> properly those simple written languages will use the same
> complex/generalized definitions and mecanisms.
It's the complexity of the real world. We should not deny it and it's
actually a monstrous task the Unicode consortium has undertaken and I
respect them for that, even though many of their solutions seem too
complicated.
They also should not bend to the Emoji-crowd so easily. Unicode is the
"standard" trying to encompass human language writing systems. I don't
really want to think about what the people 5 generations ahead might
think about the poop emoji.
> On the rendering side, those complex mecanisms allow font designers
> to spare a good chunk of work: the one required for pre-combined
> glyphs. Expect in fonts less and less pre-combined glyphs, with a
> uniq unicode points mapping to them, and that even for simple written
> languages. And expect lighter font files.
This is an interesting point.
> It means there is no good real middle ground (a good middle ground in
> the web would be, basic xhtml without javascript).
Javascript has its purposes if applied lightly and always as an
afterthought (i.e. the page works 100% without Javascript).
> And st in all that?
> Do like linux line discipline drivers? Namely do handle utf8 encoded
> unicode code points (no extended graphem cluster) only, and actually
> do work on ascii?
As I said earlier, the terminal emulation itself is unaffected because
it is more or less "blind" to the higher level of Unicode and even
UTF-8. The control sequences are ASCII and the code as is works and
does not need to be changed.
What it's all about is the rendering part and this is a section where
applications have a big say of course. Only a tiny tiny fraction of
applications really does "respect" extended grapheme clusters and most,
at most, do still assume code point == grapheme cluster, sbase/ubase
included.
This is not a bash or anything but really just due to the fact that all
this processing on higher layers is a question of efficiency,
especially when e.g. the UNIX system tools are used with plain ASCII
data 99% of the time, not requiring all the UTF-8 processing.
> For suckless, as a consistant whole, it means:
> - It becomes an ascii only framework (Anselm, seems to like this),
> and will be kind of useless for any text interacting application
> going beyond ascii (i.e. no more mutt with non ascii email, no more
> lynx with non ascii only web page...). A zero-i18n framework. In the
> case of wayland st: own ascii bitmap fonts and own font renderer.
I would not favor such a solution, but this is just my opinion.
> - suckless gets its own unicode handling code
> (libicu/freetype+harfbuzz look-alike implementation).
This is the other extreme. If I found the time I'd spend more time on
the library I've been working on, which is more or less optimized for
stream-processing, which in my opinion many of the other unicode
libraries are lacking with.
I've not yet dared to touch NFD or generally normalization and string
comparison, but for simple stream-based operations and to get a grasp
of a stream and where the bounds for extended grapheme clusters are
you, by definition of bound(), only need to know the current and
previous code point to know when a "drawn character" is finished.
Still even there we would need bounds, as Unicode sets no limit for the
size of an extended grapheme cluster. But this is a "problem" of the
implementing application itself and not of the library, which I strive
to have no memory allocations at all.
With best regards
Laslo
--
Laslo Hunhold <dev_AT_frign.de>
Received on Thu Sep 27 2018 - 22:06:25 CEST