Re: [wmii] moving liblitz on from Anselm R. Garbe on 2006-05-24 (wmii mail list archive)

From: Anselm R. Garbe <garbeam_AT_wmii.de>
Date: Wed, 24 May 2006 17:16:01 +0200

On Wed, May 24, 2006 at 02:25:57PM +0200, Denis Grelich wrote:
> On Wed, 24 May 2006 08:48:58 +0200
> "Anselm R. Garbe" <garbeam_AT_wmii.de> wrote:
V
> > On Wed, May 24, 2006 at 01:59:22AM +0200, Denis Grelich wrote:
>
> > > Markup of Font and Style
> >
> > No, markup is the wrong direction, one better uses a different
> > representation for glyph rendering then marking up the data with
> > junk.
> Of course the internal data strips the markup from the source. I rather
> think of something like:
>
> "RED{Sendmail} could allow a IT{remote attacker} to execute
> BOLD{arbitrary code} as RED{BOLD{IT{root}}}, caused by a signal BLUE
> {race vulnerability}."
> with RED and the {} and stuff being invisible through the use of
> invisible and ignorable characters from the private/tagging plane. How
> the markup is represented internally is another issue.

That is crap in my eyes. This makes it necessary that you
parse/format your data stream all the time... My proposal is not
duplicating the data stream, the Rune in Glyph simply points to
the correct place within the data stream.

No need to duplicate any information.

> > typedef struct {
> > Rune rune;
> > Font font;
> > Color fg, bg;
> > unsigned int w, h;
> > Bool invert; /* for selection */
> > /* updated by rendering */
> > unsigned int x, y;
> > } Glyph;
>
> With Rune being a char*-like type, as a Glyph can consist of an
> infinite amount of combining characters. With such a structure
> for /each/ letter you would use up about a dozen times as much memory!

What? A rune will take up to 21bit at maximum (which means
32bit, same with an unsinged int). In average it will only need
2byte like wchar_t or Rune from P9 (though in P9 is possible to
encode a single glyph with two Runes, but that is rare).

> Just imagine a full-screen text editor with a small font, and then the
> user inserts a character. You are going to move several megabytes on
> each typed-in character oO

Yes, how do you think a terminal works? You only allocate the
Glyph structure for the current window height and font metric.
In the first step you use a fixed font (like terminals).

This costs some memory, yes, the more memory you use, the faster
your rendering. My rendering would be very fast and allow
everything, without reallocations. Why should one need to
reallocate? If your window is 120 cols x 70 rows, you just fill
all glyphs which don't represent a symbol with blanks (like in
your terminal). Your widget_data would be of size
Glyph[70][120], this is 70*120*(sizeof(Glyph)) = 445200 byte
(~450kb) - not much if you perform a top and checkout that xterm
uses 4MB.

> > A text widget is simply a 2-dimensional array of such glyphs:
> >
> > Glyph widget_data[ROWS][COLS];
> >
> > Now you have some text as input, e.g.:
> >
> > char *text =
> > "Sendmail could allow a remote attacker to execute arbitrary code
> > as root, caused by a signal race vulnerability.";
> >
> > All you need is a prepare function, which transforms each glyph
> > of this text into above widget_data, e.g. using following steps:
>
> How do you handle operations on this text now? Do those operations work
> on »text« or on »widget_data?« If they work on »text,« (and if »text«
> is no gapped array) then you a) very likely have to reallocate it or
> move large parts of it and b) have to drive it through the filter again!
> And if you want to operate on the Glyph-array, it makes it again
> unneccessarily complex and awfully unperformant to manipulate the text.

Sure any operation does not work on the text directly. Sure it
works on the widget_data, all other cases would be just totally
braindamaged. The widget_data can address any symbol in nearly
O(log(n) + some junk) complexity of the textlength.

> Apart from this, I don't see any need for a two-dimensional array
> holding the text. The array is unneccessary as the line-breaking
> algorithm already can take care of a characters-per-line-limitation.

Maybe it was unclear. The text is stored in some buffer, e.g.
char buf[0xffff]; The Rune's in the widget_data cells point to
the appropriate buffer index (actually you don't need a Rune at
all, you could simply use an index, but then you need to recheck
the UTF8-value on each access, thus I used a Rune (which might
duplicate the memory use, but is much faster for rendering,
because they only need to be updated if something near them
changed).

> You also would have to calculate some sort of average width of a
> character to at least get most of the time useful numbers for COLS
> (which /will/ fail with some fonts badly!) Font and style can be
> managed easily and efficient in a parallel structure (also a sort of
> gapped array.)

In first step you use a fixed font. In the second step you
can decide to get rid of the fixed-font limitations due a much
slower algorithm, which still works quite similiar (and which
can be used for fixed font rendering as well, I think in Plan 9
it is done that way, that's why 9term is so fucking slow).
But I always told, that I would NEVER do it. A fixed widget is
fine enough for our needs, and nobody will care if titles are
fixed or not. Also I think it is insane to support a richtext
widget.

> Working with an array of variable-width lines (pointers into the
> gapped array-structure) on the other hand are a very natural way to work
> with text.

I'm not sure I fully understand your gapped array structure. But
if so, it is pretty much the same I told you about all the time.

> Effectively just stripping the markup. Or leaving it in place, as the
> markup does not get in the way for /any/ algorithm, as all algorithms
> just ignore it automagically, when using private use/tagging
> characters! That's the beauty of this technique.

Uh? That is so insane, that you could use HTML or stick with
ANSI sequences. You alway need to parse the shit. That's why I
never would use such crap. Use a clean text without more
cluttering control chars than ' ', \t, and \n. And before you
parse the text again and again to ignore control sequences, you
do a semantic analyze on it (e.g. syntax highlighting). Because
this keeps the input data so small, like the real data. If you
use script(1) to record a terminal session, you will notice in
an average terminal, that most data which is handled is control
sequence data to colorize text portions. It is insane to repeat
these faults and to parse 60kb ANSI sequences to render a text
which consists of 800 characters.

> > The rendering alorithm needs to arrange the glyphs on rendering,
> > e.g. bigger glyphs (in non-fixed way) will increase the
> > difference to the previous row. But each row might have
> > a non-fixed amount of horizontal glyphs (if the font is fixed,
> > it will behave like a terminal, all glyph boxes will have the
> > same geometry).
>
> Same thing, but much more natural with an array of lines (= pointers
> into the text structure.) But as you can't draw any unicode character

Well, with an array of lines only, you need to caculate pointer
accesses depending on the complete line, which is very unprecise
and slow. You won't be able to select text in a fast way, for
each motion you simply need to check the font metric (which is
very expensive). With my way, you simply implement a function:

Glyph *glyph_of_pointer(int x, int y);

Which calculates the row (if fixed in O(1), otherwise with
O(log(n) + crap), same with col.

> on its own, as they might interact typographically (at least in
> non-fixed fonts), you have to either copy a lot with the
> two-dimensional Glyph array, or to live with a very crappy drawing
> routine.

You don't have to copy anything. You just keep a pointer to each
char, or better both to each char and Glyph, because your widget
structure don't consumes more memory than the window size.
To allow paging/scrolling the glyph structure is made 3xwindow
size at the best case. But this depends if scrolling/paging is
necessary.

> > For mouse interaction you first seek the correct line using the
> > pointer's y-position, then you seek the correct glyph using the
> > pointer's x-position. This is linear behavior. (With non-fixed
> > boxes one cannot do it faster).
>
> Same thing for the array of lines.

Well, for small widgets your way might be sufficient, but once
you need text selection it will get very slow and flaky.

> So, as a summary: the two-dimensional glyph structure wields no
> benefit, but a huge cost on performance and code complexity and most
> likely on rendering quality.

The opposite, except memory consumption, is the case. It is the
best performing variant.

> > Well, I don't think you should bother all these problems,
> > because I don't see any need to bother. Sorting should use the
> > order defined by UTF-8, nothing else. Comparision should use the
> > numeric values of runes defined by UTF-8. Regular expressions should
> > use the numeric values of runes defined by UTF-8. Breaking algorithms
> > should use " \t\n" as breaking runes.
> >
> > If UTF-8 is broken in this regard (I doubt it, because all
> > major sorting issues should be defined in the correct order in
> > UTF-8, and umlauts or special chars in languages which have a
> > latin base alphabeth simply appear after z, nothing wrong with
> > this), then I won't care about it. Do it the most simple way it
> > can be done. Don't overcomplicate things, until there is no need
> > to do so.
>
> Firstly, UTF-8 defines nothing, only the representation of Unicode code
> points as an 8-bit stream. Unicode's coninuous code points give
> actually no hint at all for comparison! You /can/ use it for
> internals, like binary trees or stuff (but remember, sorting on utf-8
> byte values is going to fail very miserably, and you would need to
> recalculate the code point values nonetheless). On the user side, it

I don't care. UTF-8 is 7bit ASCII compliant, if other alphabets
are not sorted within UTF-8 that is not our problem. Each value
is a number between 0 and 1.xM, so why do you care?

> would more often look like a random order than anything else. So, no
> chance here. There are default sorting definitions from the Unicode
> consortium, but it is not less complicated than a locale-dependend
> implementation and is only meant for use when LANG=C is defined or so.

It is totally irrelevant for a text widget. Such a widget does
not need to sort. And I don't plan to implement any sorting
algorithm which needs a specific mapping or reorder of UTF-8
glyphs.

> Same thing with regular expressions. Code point values (and especially
> UTF-8 byte values!) give about zero hints about character ranges and
> classes. Just check Perl's regular expressions.

I don't give a shit. Wether an expression matches or not. Regexp
implementations don't need anything else than using ranges
between start and end char. They should use the order defined by
UTF-8.

Regards,

-- 
 Anselm R. Garbe  ><><  www.ebrag.de  ><><  GPG key: 0D73F361

Received on Wed May 24 2006 - 17:16:01 UTC

This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:44 UTC