Re: [wmii] moving liblitz on from Denis Grelich on 2006-05-24 (wmii mail list archive)

From: Denis Grelich <denisg_AT_ueberl33t.info>
Date: Wed, 24 May 2006 18:40:00 +0200

On Wed, 24 May 2006 17:16:01 +0200
"Anselm R. Garbe" <garbeam_AT_wmii.de> wrote:

> On Wed, May 24, 2006 at 02:25:57PM +0200, Denis Grelich wrote:
> > On Wed, 24 May 2006 08:48:58 +0200
> > "Anselm R. Garbe" <garbeam_AT_wmii.de> wrote:
> V
> > > On Wed, May 24, 2006 at 01:59:22AM +0200, Denis Grelich wrote:
> >
> > > > Markup of Font and Style
> > >
> > > No, markup is the wrong direction, one better uses a different
> > > representation for glyph rendering then marking up the data with
> > > junk.
> > Of course the internal data strips the markup from the source. I
> > rather think of something like:
> >
> > "RED{Sendmail} could allow a IT{remote attacker} to execute
> > BOLD{arbitrary code} as RED{BOLD{IT{root}}}, caused by a signal BLUE
> > {race vulnerability}."
> > with RED and the {} and stuff being invisible through the use of
> > invisible and ignorable characters from the private/tagging plane.
> > How the markup is represented internally is another issue.
>
> That is crap in my eyes. This makes it necessary that you
> parse/format your data stream all the time... My proposal is not
> duplicating the data stream, the Rune in Glyph simply points to
> the correct place within the data stream.

Not at all. You parse it once on input and format it once on output.
While editing, manipulating, rendering, markup is not used and not
neccessary for styles and font.

> > > typedef struct {
> > > Rune rune;
> > > Font font;
> > > Color fg, bg;
> > > unsigned int w, h;
> > > Bool invert; /* for selection */
> > > /* updated by rendering */
> > > unsigned int x, y;
> > > } Glyph;
> >
> > With Rune being a char*-like type, as a Glyph can consist of an
> > infinite amount of combining characters. With such a structure
> > for /each/ letter you would use up about a dozen times as much
> > memory!
>
> What? A rune will take up to 21bit at maximum (which means
> 32bit, same with an unsinged int). In average it will only need
> 2byte like wchar_t or Rune from P9 (though in P9 is possible to
> encode a single glyph with two Runes, but that is rare).

A /glyph/ can take as many bytes as it wants. You can add a dozen
combining characters to a character. The letter ä for example would
consist of two Runes. On the other side, you save format information
for every character. This is crazy! It is enough to save it for blocks
of text, as it is the case with a parallel structure. It is very
lightweight, as it only had about seven entries for the example from
above.

> > Just imagine a full-screen text editor with a small font, and then
> > the user inserts a character. You are going to move several
> > megabytes on each typed-in character oO
>
> Yes, how do you think a terminal works? You only allocate the
> Glyph structure for the current window height and font metric.
> In the first step you use a fixed font (like terminals).

That's what the interface looks like to the outside. It does not define
anything on how it has to look on the inside.

> This costs some memory, yes, the more memory you use, the faster
> your rendering. My rendering would be very fast and allow
> everything, without reallocations. Why should one need to
> reallocate? If your window is 120 cols x 70 rows, you just fill
> all glyphs which don't represent a symbol with blanks (like in
> your terminal). Your widget_data would be of size
> Glyph[70][120], this is 70*120*(sizeof(Glyph)) = 445200 byte
> (~450kb) - not much if you perform a top and checkout that xterm
> uses 4MB.

Are you nuts? oO
Every time you /insert a character/ at the beginning of your text, you
move 4 MB of data! On every key press! With a gapped array, you would
move nothing at all, except from 250 bytes /once/ when starting to type.

> > > A text widget is simply a 2-dimensional array of such glyphs:
> > >
> > > Glyph widget_data[ROWS][COLS];
> > >
> > > Now you have some text as input, e.g.:
> > >
> > > char *text =
> > > "Sendmail could allow a remote attacker to execute arbitrary code
> > > as root, caused by a signal race vulnerability.";
> > >
> > > All you need is a prepare function, which transforms each glyph
> > > of this text into above widget_data, e.g. using following steps:
> >
> > How do you handle operations on this text now? Do those operations
> > work on »text« or on »widget_data?« If they work on »text,« (and if
> > »text« is no gapped array) then you a) very likely have to
> > reallocate it or move large parts of it and b) have to drive it
> > through the filter again! And if you want to operate on the
> > Glyph-array, it makes it again unneccessarily complex and awfully
> > unperformant to manipulate the text.
>
> Sure any operation does not work on the text directly. Sure it
> works on the widget_data, all other cases would be just totally
> braindamaged. The widget_data can address any symbol in nearly
> O(log(n) + some junk) complexity of the textlength.

It's no big deal with a gapped array. Random access can be made to be
O(1) with a similar large constant (with some buffering) as with the
Glyph array; worst case would be somwhat near O(log n). As already
stated several times, handling the array would mean updating everything
on insertion/deletion. This is just insane!

> > Apart from this, I don't see any need for a two-dimensional array
> > holding the text. The array is unneccessary as the line-breaking
> > algorithm already can take care of a characters-per-line-limitation.
>
> Maybe it was unclear. The text is stored in some buffer, e.g.
> char buf[0xffff]; The Rune's in the widget_data cells point to
> the appropriate buffer index (actually you don't need a Rune at
> all, you could simply use an index, but then you need to recheck
> the UTF8-value on each access, thus I used a Rune (which might
> duplicate the memory use, but is much faster for rendering,
> because they only need to be updated if something near them
> changed).

You need get_{next,previous}_grapheme_cluster in any case.

> > You also would have to calculate some sort of average width of a
> > character to at least get most of the time useful numbers for COLS
> > (which /will/ fail with some fonts badly!) Font and style can be
> > managed easily and efficient in a parallel structure (also a sort of
> > gapped array.)
>
> In first step you use a fixed font. In the second step you
> can decide to get rid of the fixed-font limitations due a much
> slower algorithm, which still works quite similiar (and which
> can be used for fixed font rendering as well, I think in Plan 9
> it is done that way, that's why 9term is so fucking slow).
> But I always told, that I would NEVER do it. A fixed widget is
> fine enough for our needs, and nobody will care if titles are
> fixed or not. Also I think it is insane to support a richtext
> widget.

With a line array you get a much faster, /WORKING/ widget with free
support for non-fixed fonts.
The overhead for richtext, so I estimate at (but I don't have much of a
clue, so to say) somwhere around 50–60 SLOC. One third for
configuration, one third for parsing and the parallel structure, one
third for rendering.

> > Working with an array of variable-width lines (pointers into the
> > gapped array-structure) on the other hand are a very natural way to
> > work with text.
>
> I'm not sure I fully understand your gapped array structure. But
> if so, it is pretty much the same I told you about all the time.

It is explained in for example
http://www.bluemug.com/research/text.pdf beginning from page 8.

> > Effectively just stripping the markup. Or leaving it in place, as
> > the markup does not get in the way for /any/ algorithm, as all
> > algorithms just ignore it automagically, when using private
> > use/tagging characters! That's the beauty of this technique.
>
> Uh? That is so insane, that you could use HTML or stick with
> ANSI sequences. You alway need to parse the shit. That's why I
> never would use such crap. Use a clean text without more
> cluttering control chars than ' ', \t, and \n. And before you
> parse the text again and again to ignore control sequences, you
> do a semantic analyze on it (e.g. syntax highlighting). Because
> this keeps the input data so small, like the real data. If you
> use script(1) to record a terminal session, you will notice in
> an average terminal, that most data which is handled is control
> sequence data to colorize text portions. It is insane to repeat
> these faults and to parse 60kb ANSI sequences to render a text
> which consists of 800 characters.

You don't have to parse anything if you don't want to. That is the
beauty of this technique. Those characters have no width, no glyph,
they are ignored by any unicode-aware algorithm. Firstly, we strip them
off when parsing. It's a matter of less than half-a-dozen SLOC. Then
they're gone. The information is preserved in a parallel structure.
Very fast and efficient. NO PARSING ANYMORE FROM HERE ON. When we
output the text, for example writing it to a file or to STDOUT, we
format the data again with them. We don't even need to reformat for
rendering.

> > > The rendering alorithm needs to arrange the glyphs on rendering,
> > > e.g. bigger glyphs (in non-fixed way) will increase the
> > > difference to the previous row. But each row might have
> > > a non-fixed amount of horizontal glyphs (if the font is fixed,
> > > it will behave like a terminal, all glyph boxes will have the
> > > same geometry).
> >
> > Same thing, but much more natural with an array of lines (= pointers
> > into the text structure.) But as you can't draw any unicode
> > character
>
> Well, with an array of lines only, you need to caculate pointer
> accesses depending on the complete line, which is very unprecise
> and slow. You won't be able to select text in a fast way, for
> each motion you simply need to check the font metric (which is
> very expensive). With my way, you simply implement a function:
>
> Glyph *glyph_of_pointer(int x, int y);
>
> Which calculates the row (if fixed in O(1), otherwise with
> O(log(n) + crap), same with col.

SAME thing with arrays of lines. Selecting the line is O(1) with fixed
height, O(log n) as every line can still hold y positions. For
selecting a particular character you have O(log n) for both fixed and
non-fixed width, or even faster if you do some sort of caching (which I
count unnecessary.)

> > on its own, as they might interact typographically (at least in
> > non-fixed fonts), you have to either copy a lot with the
> > two-dimensional Glyph array, or to live with a very crappy drawing
> > routine.
>
> You don't have to copy anything. You just keep a pointer to each
> char, or better both to each char and Glyph, because your widget
> structure don't consumes more memory than the window size.
> To allow paging/scrolling the glyph structure is made 3xwindow
> size at the best case. But this depends if scrolling/paging is
> necessary.

Yes, paging is neccessary in any case, with the gapped array too. The
reallocations and copying occur on text editing and insertion, not on
scrolling.

> > > For mouse interaction you first seek the correct line using the
> > > pointer's y-position, then you seek the correct glyph using the
> > > pointer's x-position. This is linear behavior. (With non-fixed
> > > boxes one cannot do it faster).
> >
> > Same thing for the array of lines.
>
> Well, for small widgets your way might be sufficient, but once
> you need text selection it will get very slow and flaky.

I don't see where it could be slow and flaky. It is about as fast as
your approach. And even if it /was/, it is better to have flaky text
selection than flaky text editing.

> > So, as a summary: the two-dimensional glyph structure wields no
> > benefit, but a huge cost on performance and code complexity and most
> > likely on rendering quality.
>
> The opposite, except memory consumption, is the case. It is the
> best performing variant.

As I stated several times, it's not. Maybe there are some
misunderstandings between us, but it is definitely not.

> > > Well, I don't think you should bother all these problems,
> > > because I don't see any need to bother. Sorting should use the
> > > order defined by UTF-8, nothing else. Comparision should use the
> > > numeric values of runes defined by UTF-8. Regular expressions
> > > should use the numeric values of runes defined by UTF-8. Breaking
> > > algorithms should use " \t\n" as breaking runes.
> > >
> > > If UTF-8 is broken in this regard (I doubt it, because all
> > > major sorting issues should be defined in the correct order in
> > > UTF-8, and umlauts or special chars in languages which have a
> > > latin base alphabeth simply appear after z, nothing wrong with
> > > this), then I won't care about it. Do it the most simple way it
> > > can be done. Don't overcomplicate things, until there is no need
> > > to do so.
> >
> > Firstly, UTF-8 defines nothing, only the representation of Unicode
> > code points as an 8-bit stream. Unicode's coninuous code points give
> > actually no hint at all for comparison! You /can/ use it for
> > internals, like binary trees or stuff (but remember, sorting on
> > utf-8 byte values is going to fail very miserably, and you would
> > need to recalculate the code point values nonetheless). On the user
> > side, it
>
> I don't care. UTF-8 is 7bit ASCII compliant, if other alphabets
> are not sorted within UTF-8 that is not our problem. Each value
> is a number between 0 and 1.xM, so why do you care?

I have enough of computers that behave like they were build in the
60ies. Text is such a basic thing in our world that it is ridiculous
that most programs still can't handle it properly.

> > would more often look like a random order than anything else. So, no
> > chance here. There are default sorting definitions from the Unicode
> > consortium, but it is not less complicated than a locale-dependend
> > implementation and is only meant for use when LANG=C is defined or
> > so.
>
> It is totally irrelevant for a text widget. Such a widget does
> not need to sort. And I don't plan to implement any sorting
> algorithm which needs a specific mapping or reorder of UTF-8
> glyphs.

Yes, it is irrelevant for a text widget, but not for a list widget, for
example.

> > Same thing with regular expressions. Code point values (and
> > especially UTF-8 byte values!) give about zero hints about
> > character ranges and classes. Just check Perl's regular expressions.
>
> I don't give a shit. Wether an expression matches or not. Regexp
> implementations don't need anything else than using ranges
> between start and end char. They should use the order defined by
> UTF-8.

Sorry, but this is totally ridiculous.

Greetings,
Denis

application/pgp-signature attachment: stored

Received on Wed May 24 2006 - 18:40:14 UTC

This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:45 UTC