On Wed, May 24, 2006 at 01:59:22AM +0200, Denis Grelich wrote:
> Reasons for UTF-8
No question, agreed.
> Markup of Font and Style
No, markup is the wrong direction, one better uses a different
representation for glyph rendering then marking up the data with
junk. Actually I'm not sure what you mean with gapped arrays,
but for me something like the following comes to mind, to
represent text ready for drawing in a terminal-capable
text-widget:
typedef struct {
Rune rune;
Font font;
Color fg, bg;
unsigned int w, h;
Bool invert; /* for selection */
/* updated by rendering */
unsigned int x, y;
} Glyph;
A text widget is simply a 2-dimensional array of such glyphs:
Glyph widget_data[ROWS][COLS];
Now you have some text as input, e.g.:
char *text =
"Sendmail could allow a remote attacker to execute arbitrary code
as root, caused by a signal race vulnerability.";
All you need is a prepare function, which transforms each glyph
of this text into above widget_data, e.g. using following steps:
1. Use iconv to convert the given text from current locale into
UTF-8.
2. Run a filter on the text which converts each character or
words (for syntax highlighting) into the text_widget structure
(2 remembers the glyph offsets where changes have been done)
3. Render the text_widget structure in the changed range
4. Map this stuff
The only complicated thing is the filter. A simple filter won't
do colorization/different font metrics/style, it would simply
use a font for each glyph (and if one really wants fancy font
stuff, one should write such a filter later).
The rendering alorithm needs to arrange the glyphs on rendering,
e.g. bigger glyphs (in non-fixed way) will increase the
difference to the previous row. But each row might have
a non-fixed amount of horizontal glyphs (if the font is fixed,
it will behave like a terminal, all glyph boxes will have the
same geometry).
For mouse interaction you first seek the correct line using the
pointer's y-position, then you seek the correct glyph using the
pointer's x-position. This is linear behavior. (With non-fixed
boxes one cannot do it faster).
> Outsourcing/Distributing Functionality
For performance reasons above filter functions should not be
outsourced, they should be in a sane library. Maybe as
extensions.
> Both the UNIX and Plan9 philosophies are all about distinct, specialized tools
> that do one thing, and do this one thing really good. Therefore we got the
> concept of Pipes, fifos and the “everything is a file” idea in Unix, and the
> escalation of this, 9p.
> As rendering, editing and actually all working with text are enormously
> complicated subjects (which is not a fact just inherent to Unicode text, but to
> any text in any encoding!), trying to do something that does all of the
> aspects of working with texts well results in chaos and in the end, everything
> is murky. Consider especially the fact that most operations on text strongly
> depend on locale! See following examples:
> • Sorting: alphabets in different languages/countries are different. A
> German for example may sort like a, b, c, d, a Spaniard rather sorts like
> a, b, c, c͏h, d or something like that; also, think of punctuation and
> stuff. The French would sort a words with accents totally different than
> other people would do!
> • Transliteration: needed for sorting or entering text of a script other
> than mine. How do you sort string from different scripts? Wouldn't you
> want to sort “Gorbachev” right before or after “Горбачев?” (“Gorbachev”
> written in Cyrillic letters.)
> • Comparison/search: different characters might compare as equal or not in
> different countries/languages. One might want to ignore smaller or larger
> differencies between characters (for example o vs ô). Another especially
> interesting field here is upper/lowercase mapping. Comparison in text
> is much more than bit-for-bit comparison, especially with Unicode!
> • Regular expressions: how do you specify a range, if you can't make any
> assumptions about continuous code point ranges?
> • Line- and word-breaking algorithms: In some languages, not only
> word-breaking, but also line breaking needs a dictionnary! (For example
> Chinese or Korean, where spaces normally are not used!)
> • This intersects also with finding word- and sentence boundaries: needed
> when implementing double- or triple-click selection of text! (Goes hand in
> hand with /plumbing!/)
> And there are surely many, many more. To put all this into the functionality
> of one widget could drive you mad. A much more beautiful solution would be to
> out-source this functionality in stand-alone applications like it is done with
> spell checking through i/a/spell today (here again, when the internal format
> is UTF-8, no penalty is paid here, which is especially important when one
> quite frequently communicates with such functionality apps.)
> Also, compare this to a similar approach that acme takes with external
> commands for search&replace and others.
> Nonetheless, this still is a little bit tricky. A line-breaking algorithm has
> to communicate with the rendering system all the time. I don't know what
> technologies would be needed to be applied here. Maybe 9P is powerful enough
> for that, or maybe it's not. This really is a topic worth a long discussion!
Well, I don't think you should bother all these problems,
because I don't see any need to bother. Sorting should use the
order defined by UTF-8, nothing else. Comparision should use the
numeric values of runes defined by UTF-8. Regular expressions
should use the numeric values of runes defined by UTF-8.
Breaking algorithms should use " \t\n" as breaking runes.
If UTF-8 is broken in this regard (I doubt it, because all
major sorting issues should be defined in the correct order in
UTF-8, and umlauts or special chars in languages which have a
latin base alphabeth simply appear after z, nothing wrong with
this), then I won't care about it. Do it the most simple way it
can be done. Don't overcomplicate things, until there is no need
to do so.
> Why the gapped array wins
> Many operations on unicode text may translate one/some character/s to a
> smaller or larger amount of characters, effectively growing or shrinking the
> text size. The gapped array copes with that without reallocating. That's
> another reason why I propose to use a gapped text structure as the default
> text structure of liblitz or other librarys an programs.
Well, explain what you mean with gapped array.
-- Anselm R. Garbe ><>< www.ebrag.de ><>< GPG key: 0D73F361Received on Wed May 24 2006 - 08:48:59 UTC
This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:42 UTC