Re: [wmii] moving liblitz on from Denis Grelich on 2006-05-25 (wmii mail list archive)

From: Denis Grelich <denisg_AT_ueberl33t.info>
Date: Thu, 25 May 2006 02:06:55 +0200

On Wed, 24 May 2006 19:16:42 +0200
"Anselm R. Garbe" <garbeam_AT_wmii.de> wrote:
> On Wed, May 24, 2006 at 06:40:00PM +0200, Denis Grelich wrote:
> > > That is crap in my eyes. This makes it necessary that you
> > > parse/format your data stream all the time... My proposal is not
> > > duplicating the data stream, the Rune in Glyph simply points to
> > > the correct place within the data stream.
> >
> > Not at all. You parse it once on input and format it once on output.
> > While editing, manipulating, rendering, markup is not used and not
> > neccessary for styles and font.
>
> Which means I parse everything on every keypress or what? A
> syntax highlighting algorithm needs to be updated on keypress
> (it might effect a complete C file to enter /*...
>

Of course not! I hope you don't think that I'm that stupid. It looks
rather like that:

  File: "RED{Some red text} BLUE{Some blue text}"

  READ File
  PARSE File
  We now get two structures, Text and Style
    Text: "Some red text Some blue text"
    Style: ((style=RED, start=0, len=12),
            (style=BLUE, start=13, len=13))

From here on, there is no parsing anymore. All metadata has been
stripped and put into the style structure (Which had not been
neccessary as any subsequent operation would automagically ignore the
metadata). Any update on style is done in the style structure.

  DO SOME COLOUR EDITING:
    Text: "Some red text Some blue text"
    Style: ((style=RED, start=0, len=12),
            (style=BLUE, start=13,len=13),
            (style=GREEN, start=9, len=4))

  FORMAT File2
  WRITE File2

  File2 is now: "RED{Some red GREEN{text}} BLUE{Some blue text}}"

I hope it is now clear. Of course it is just an example. The style
structure is no simple array but something more sophisticated and stuff,
but that's what it is basically. Working with the style structure is
quite easy and can be made very performant with some interesting
techniques.

What you do with syntax highlighting has nothing to do with this. Syntax
highlighting /uses/ the style metadata structure, but that's it. The
syntax highlighting structure has to parse the text itself.

> > A /glyph/ can take as many bytes as it wants. You can add a dozen
> > combining characters to a character. The letter ä for example would
> > consist of two Runes. On the other side, you save format information
> > for every character. This is crazy! It is enough to save it for
> > blocks of text, as it is the case with a parallel structure. It is
> > very lightweight, as it only had about seven entries for the
> > example from above.
>
> I agree that you can save it in blocks. But I never claimed I
> would save format information for every character, I'd only do
> that for every _visible_ character. About 10 years ago, I know
> someone in the BBS scene who write a neat ANSI editor. And it
> consumed not much memory to safe a bunch of info for each
> visible character. It is much less, than doing a screenshot.

Heh. And the style information of the invisible text is lost or what?
Or do you keep it in ANSI-madness and parse it on scrolling?

> > > This costs some memory, yes, the more memory you use, the faster
> > > your rendering. My rendering would be very fast and allow
> > > everything, without reallocations. Why should one need to
> > > reallocate? If your window is 120 cols x 70 rows, you just fill
> > > all glyphs which don't represent a symbol with blanks (like in
> > > your terminal). Your widget_data would be of size
> > > Glyph[70][120], this is 70*120*(sizeof(Glyph)) = 445200 byte
> > > (~450kb) - not much if you perform a top and checkout that xterm
> > > uses 4MB.
> >
> > Are you nuts? oO
> > Every time you /insert a character/ at the beginning of your text,
> > you move 4 MB of data! On every key press! With a gapped array, you
> > would move nothing at all, except from 250 bytes /once/ when
> > starting to type.
>
> That is a cheap operation with a sane MMU, because you would use
> memcpy with a size of < 128k for example - and you would never
> need to memcpy more stuff than simply the visible data - and
> only in the case of line break. If you want to do it with a
> list, no problem, you can also dynamically allocate/reallocate
> blocks or lines. I doubt this will be faster though (because a
> memcpy < 128kb is known to be faster nowadays than allocating
> small bunches of mem all the time) - though you shouldn't care
> about either case, your OS should know it better than any app
> programmer).

In any case, the gapped array is never worse than the Glyph array. You
have to move only small blocks on cursor movement (and no, not each
time you move the cursor, but each time you start editing after cursor
movement!), and you have to reallocate all the block if the gap is used
up. Subsequent changes in the file need no movements and no
reallocations at all (until the gap is full.)

> > It's no big deal with a gapped array. Random access can be made to
> > be O(1) with a similar large constant (with some buffering) as with
> > the Glyph array; worst case would be somwhat near O(log n). As
> > already stated several times, handling the array would mean
> > updating everything on insertion/deletion. This is just insane!
>
> No, only on linbreaks... in special cases.

And excuse me, line breaks are really not that special. They happen
actually quite often.

> > With a line array you get a much faster, /WORKING/ widget with free
> > support for non-fixed fonts.
> > The overhead for richtext, so I estimate at (but I don't have much
> > of a clue, so to say) somwhere around 50–60 SLOC. One third for
>
> I think richtext will be 5kSLOC at least.

Depends on the implementation. There are one million ways to implement
it, and with line arrays and the parallel structure for styles, I don't
see where those 5 kSLOC might be. It's not much more than loading the
fonts and colour values into some configuration structure, parsing the
text and building the style structure, reflect changes on the text in
it (several lines), and rendering does Xlib.

> > You don't have to parse anything if you don't want to. That is the
> > beauty of this technique. Those characters have no width, no glyph,
> > they are ignored by any unicode-aware algorithm. Firstly, we strip
> > them off when parsing. It's a matter of less than half-a-dozen
> > SLOC. Then they're gone. The information is preserved in a parallel
> > structure. Very fast and efficient. NO PARSING ANYMORE FROM HERE
> > ON. When we output the text, for example writing it to a file or to
> > STDOUT, we format the data again with them. We don't even need to
> > reformat for rendering.
>
> If you implement an editor with syntax highlighting you need to
> parse on each input.

We are not talking about that. Or are we talking at cross
purposes /that/ badly?

> > Yes, paging is neccessary in any case, with the gapped array too.
> > The reallocations and copying occur on text editing and insertion,
> > not on scrolling.
>
> Copying must occur, at least in form of XMapArea, which is
> nothing else than memcpy (with about 4MB with 1024x768_AT_24bit)...

Sure. But not unnecessary copying ;-)

> > I don't see where it could be slow and flaky. It is about as fast as
> > your approach. And even if it /was/, it is better to have flaky text
> > selection than flaky text editing.
>
> No, flaky text selection is no option.

Neither is flaky editing.

> > > I don't care. UTF-8 is 7bit ASCII compliant, if other alphabets
> > > are not sorted within UTF-8 that is not our problem. Each value
> > > is a number between 0 and 1.xM, so why do you care?
> >
> > I have enough of computers that behave like they were build in the
> > 60ies. Text is such a basic thing in our world that it is ridiculous
> > that most programs still can't handle it properly.
>
> Yes, but this has nothing todo with orders. If UTF8 does not
> order the alphabets it supports correctly, then UTF8 is broken.
> No one should need to implement sorting for all Unicode
> languages... otherwise Unicode is broken.

Well, Unicode /can't/ order the alphabets »correctly.« There's no such
thing as »correctly!« The philosophy of Unicode is to encode scripts,
not languages. And as many, many scripts share subsets of languages,
you simply can't put them into the right order. In addition, Unicode
had to stay backwards compatible with legacy encodings, so it kept the
orderings of national standards intact. So just look at the latin
scripts: you have full ASCII, then comes the Latin-1 block, then come
blocks with loads and loads of additional variations of latin letters.
So you mingle ASCII- and Latin-1-non-letter-characters with the latin
alphabet, and you sort all those »a«'s with accents and stuff /after/
the normal »z!« Next, look at combining characters. If you would only
care about code point values, you would put things like an a-umlaut
hell knows where, as it is represented as:
        U+0061 LATIN SMALL LETTER A + U+0308 COMBINING DIAERESIS
Furthermore, scripts like the indic ones or korean are /much/ more
complex than latin or cyrillic. They combine heavily, and they are even
displayed in a different order than they are stored! They characters by
themselves are even divided into several code points. (Yes, this
all /does/ make sense, but explaining this would blast the scope of
this discussion.)

> > > not need to sort. And I don't plan to implement any sorting
> > > algorithm which needs a specific mapping or reorder of UTF-8
> > > glyphs.
> >
> > Yes, it is irrelevant for a text widget, but not for a list widget,
> > for example.
>
> It is even irrelevant for a list widget, because the widget
> would call qsort and strcmp or whatever equivalents as interface
> funtions.

The C standard library string functions die when used with unicode.
Don't do that, it does not work. (As UTF-8 is backwards compatible with
ASCII, it works to some extent. But if you use some advanced features
of it, it breaks.) You /need/ special unicode-aware functions, either
by writing them yourself or using some other library like ICU (there is
libunicode or so for Linux, but it's rather a joke.)

> > > I don't give a shit. Wether an expression matches or not. Regexp
> > > implementations don't need anything else than using ranges
> > > between start and end char. They should use the order defined by
> > > UTF-8.
> >
> > Sorry, but this is totally ridiculous.
>
> No it is not, see libregexp9, it is UTF8-compliant for about 15
> years. And it works quite well. I also doubt that Unicode does
> not order the alphabets it supports correctly. If so, I won't
> care unless Latin is not ordered correctly.

I can't comment on that. Never used either Plan 9 nor some of its libs.
Of course, UTF-8-compliant is one thing. It is an important step if it
gives you sane results when feeded with UTF-8 instead of latin-1, as
feeding UTF-8 into a Unicode-unaware regexp engine could even make it
choke on latin characters. It does not mean that you can use the full
potential of Unicode and the UTF-8 encoding scheme!

application/pgp-signature attachment: stored

Received on Thu May 25 2006 - 02:07:08 UTC

This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:47 UTC