Re: [wmii] moving liblitz on from Denis Grelich on 2006-05-24 (wmii mail list archive)

From: Denis Grelich <denisg_AT_ueberl33t.info>
Date: Wed, 24 May 2006 14:25:57 +0200

On Wed, 24 May 2006 08:48:58 +0200
"Anselm R. Garbe" <garbeam_AT_wmii.de> wrote:

> On Wed, May 24, 2006 at 01:59:22AM +0200, Denis Grelich wrote:

> > Markup of Font and Style
>
> No, markup is the wrong direction, one better uses a different
> representation for glyph rendering then marking up the data with
> junk.
Of course the internal data strips the markup from the source. I rather
think of something like:

"RED{Sendmail} could allow a IT{remote attacker} to execute
BOLD{arbitrary code} as RED{BOLD{IT{root}}}, caused by a signal BLUE
{race vulnerability}."
with RED and the {} and stuff being invisible through the use of
invisible and ignorable characters from the private/tagging plane. How
the markup is represented internally is another issue.

> Actually I'm not sure what you mean with gapped arrays,
> but for me something like the following comes to mind, to
> represent text ready for drawing in a terminal-capable
> text-widget:
>
> typedef struct {
> Rune rune;
> Font font;
> Color fg, bg;
> unsigned int w, h;
> Bool invert; /* for selection */
> /* updated by rendering */
> unsigned int x, y;
> } Glyph;

With Rune being a char*-like type, as a Glyph can consist of an
infinite amount of combining characters. With such a structure
for /each/ letter you would use up about a dozen times as much memory!
Just imagine a full-screen text editor with a small font, and then the
user inserts a character. You are going to move several megabytes on
each typed-in character oO

> A text widget is simply a 2-dimensional array of such glyphs:
>
> Glyph widget_data[ROWS][COLS];
>
> Now you have some text as input, e.g.:
>
> char *text =
> "Sendmail could allow a remote attacker to execute arbitrary code
> as root, caused by a signal race vulnerability.";
>
> All you need is a prepare function, which transforms each glyph
> of this text into above widget_data, e.g. using following steps:

How do you handle operations on this text now? Do those operations work
on »text« or on »widget_data?« If they work on »text,« (and if »text«
is no gapped array) then you a) very likely have to reallocate it or
move large parts of it and b) have to drive it through the filter again!
And if you want to operate on the Glyph-array, it makes it again
unneccessarily complex and awfully unperformant to manipulate the text.

Apart from this, I don't see any need for a two-dimensional array
holding the text. The array is unneccessary as the line-breaking
algorithm already can take care of a characters-per-line-limitation.
You also would have to calculate some sort of average width of a
character to at least get most of the time useful numbers for COLS
(which /will/ fail with some fonts badly!) Font and style can be
managed easily and efficient in a parallel structure (also a sort of
gapped array.)

Working with an array of variable-width lines (pointers into the
gapped array-structure) on the other hand are a very natural way to work
with text.

> 1. Use iconv to convert the given text from current locale into
> UTF-8.
> 2. Run a filter on the text which converts each character or
> words (for syntax highlighting) into the text_widget structure
> (2 remembers the glyph offsets where changes have been done)
> 3. Render the text_widget structure in the changed range
> 4. Map this stuff
>
> The only complicated thing is the filter. A simple filter won't
> do colorization/different font metrics/style, it would simply
> use a font for each glyph (and if one really wants fancy font
> stuff, one should write such a filter later).

Effectively just stripping the markup. Or leaving it in place, as the
markup does not get in the way for /any/ algorithm, as all algorithms
just ignore it automagically, when using private use/tagging
characters! That's the beauty of this technique.

> The rendering alorithm needs to arrange the glyphs on rendering,
> e.g. bigger glyphs (in non-fixed way) will increase the
> difference to the previous row. But each row might have
> a non-fixed amount of horizontal glyphs (if the font is fixed,
> it will behave like a terminal, all glyph boxes will have the
> same geometry).

Same thing, but much more natural with an array of lines (= pointers
into the text structure.) But as you can't draw any unicode character
on its own, as they might interact typographically (at least in
non-fixed fonts), you have to either copy a lot with the
two-dimensional Glyph array, or to live with a very crappy drawing
routine.

> For mouse interaction you first seek the correct line using the
> pointer's y-position, then you seek the correct glyph using the
> pointer's x-position. This is linear behavior. (With non-fixed
> boxes one cannot do it faster).

Same thing for the array of lines.

So, as a summary: the two-dimensional glyph structure wields no
benefit, but a huge cost on performance and code complexity and most
likely on rendering quality.

> > Outsourcing/Distributing Functionality
>
> For performance reasons above filter functions should not be
> outsourced, they should be in a sane library. Maybe as
> extensions.

hm, okay if there's no better way.

> Well, I don't think you should bother all these problems,
> because I don't see any need to bother. Sorting should use the
> order defined by UTF-8, nothing else. Comparision should use the
> numeric values of runes defined by UTF-8. Regular expressions should
> use the numeric values of runes defined by UTF-8. Breaking algorithms
> should use " \t\n" as breaking runes.
>
> If UTF-8 is broken in this regard (I doubt it, because all
> major sorting issues should be defined in the correct order in
> UTF-8, and umlauts or special chars in languages which have a
> latin base alphabeth simply appear after z, nothing wrong with
> this), then I won't care about it. Do it the most simple way it
> can be done. Don't overcomplicate things, until there is no need
> to do so.

Firstly, UTF-8 defines nothing, only the representation of Unicode code
points as an 8-bit stream. Unicode's coninuous code points give
actually no hint at all for comparison! You /can/ use it for
internals, like binary trees or stuff (but remember, sorting on utf-8
byte values is going to fail very miserably, and you would need to
recalculate the code point values nonetheless). On the user side, it
would more often look like a random order than anything else. So, no
chance here. There are default sorting definitions from the Unicode
consortium, but it is not less complicated than a locale-dependend
implementation and is only meant for use when LANG=C is defined or so.

And if we are going to use filters in any case, one could equally well
implement all that is needed for right and good text display.

Same thing with regular expressions. Code point values (and especially
UTF-8 byte values!) give about zero hints about character ranges and
classes. Just check Perl's regular expressions.

> Well, explain what you mean with gapped array.

Check the text.{h,c} in my widget implementation. It should explain
everything. A gapped array is an array with a gap consisting of
empty space in it, that can be move to the position of
insertion/removing of text, that eliminates the need of reallocating
the array or moving large parts of it when the length of the array
changes.

Greetings,
Denis

application/pgp-signature attachment: stored

Received on Wed May 24 2006 - 14:44:04 UTC

This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:43 UTC