While reading up on the subject, I came up with several interesting ideas, and
even more quite important design decisions. If some of the assertions in my
message are wrong, please correct me!
Reasons for UTF-8
Though implementations of UTF-8 not as straightforward to implement as UTF-16
[1], most of the time they have the same, in some situations even better
performance and memory usage.
Then, UTF-8 is an adapted standard for inter-process, client-server and
whatever communcation, and is the employed encoding scheme on all Unices that
use Unicode. Thus, using anything else internally would mean a penalty, if
UTF-8 is used as encoding scheme for communicating with the outside world and
other applications. This is a very strong argument, as you will see later in
this mail.
For the normalization form, I'm not sure. Using Normalized Form “C” (meaning
umlauts and accents and such are combined to one code point whenever this is
possible [2], and, compatibility characters that come from other encodings and
are the same like other characters in Unicode but have a different font or
styling, are not sanitized [3]) is promoted by the W3C [4]. Nonetheles, this is
actually a legacy normalization form. The form propagated by the Unicode Consortium
is rather Form “D,” where combined code points are split as far as possible,
or even better Form “KD,” where compatibility compositions are disassembled
too, together with some application dependend markup of font and style, which
is actually something, that we want!
Markup of Font and Style
As even the most primitive terminal emulators have facilities for markup and
style like coloured and blinking text, bold/italic fonts, text background
colours and other stuff, it should be not too bad if the default text widget
would do that. This would have the following implications:
• an obvious one: very simple code at the terminal implementation's side for
colours and bold fonts
• getting rid of compatibility compositions
• styled text in bars, titlebars, labels, text editors, menus and any other
GUIs for free!
The markup could be done via some easy-to-parse markup language using
ignorable and for programs that don't know about that markup language
invisible characters from the private use area of Unicode or the especially
for this purpose created tag range of characters.
Outsourcing/Distributing Functionality
Both the UNIX and Plan9 philosophies are all about distinct, specialized tools
that do one thing, and do this one thing really good. Therefore we got the
concept of Pipes, fifos and the “everything is a file” idea in Unix, and the
escalation of this, 9p.
As rendering, editing and actually all working with text are enormously
complicated subjects (which is not a fact just inherent to Unicode text, but to
any text in any encoding!), trying to do something that does all of the
aspects of working with texts well results in chaos and in the end, everything
is murky. Consider especially the fact that most operations on text strongly
depend on locale! See following examples:
• Sorting: alphabets in different languages/countries are different. A
German for example may sort like a, b, c, d, a Spaniard rather sorts like
a, b, c, c͏h, d or something like that; also, think of punctuation and
stuff. The French would sort a words with accents totally different than
other people would do!
• Transliteration: needed for sorting or entering text of a script other
than mine. How do you sort string from different scripts? Wouldn't you
want to sort “Gorbachev” right before or after “Горбачев?” (“Gorbachev”
written in Cyrillic letters.)
• Comparison/search: different characters might compare as equal or not in
different countries/languages. One might want to ignore smaller or larger
differencies between characters (for example o vs ô). Another especially
interesting field here is upper/lowercase mapping. Comparison in text
is much more than bit-for-bit comparison, especially with Unicode!
• Regular expressions: how do you specify a range, if you can't make any
assumptions about continuous code point ranges?
• Line- and word-breaking algorithms: In some languages, not only
word-breaking, but also line breaking needs a dictionnary! (For example
Chinese or Korean, where spaces normally are not used!)
• This intersects also with finding word- and sentence boundaries: needed
when implementing double- or triple-click selection of text! (Goes hand in
hand with /plumbing!/)
And there are surely many, many more. To put all this into the functionality
of one widget could drive you mad. A much more beautiful solution would be to
out-source this functionality in stand-alone applications like it is done with
spell checking through i/a/spell today (here again, when the internal format
is UTF-8, no penalty is paid here, which is especially important when one
quite frequently communicates with such functionality apps.)
Also, compare this to a similar approach that acme takes with external
commands for search&replace and others.
Nonetheless, this still is a little bit tricky. A line-breaking algorithm has
to communicate with the rendering system all the time. I don't know what
technologies would be needed to be applied here. Maybe 9P is powerful enough
for that, or maybe it's not. This really is a topic worth a long discussion!
Addition: This mechanism is also quite future-proof. If standards change or
get extended, only the functionality applications which manage those parts
need to be exchanged.
Why the gapped array wins
Many operations on unicode text may translate one/some character/s to a
smaller or larger amount of characters, effectively growing or shrinking the
text size. The gapped array copes with that without reallocating. That's
another reason why I propose to use a gapped text structure as the default
text structure of liblitz or other librarys an programs.
[1] btw, IBM's ICU–which is used in both Java and LI18NUX–uses utf-16
internally
[2] Example:
the same character “ǟ” (which is, so I want to belive, used in
Vietnamese), encoded as
a U+0061 LATIN SMALL LETTER A
¨ U+0308 COMBINING DIAERESIS
¯ U+0304 COMBINING MACRON
in decomposed form, becomes
ǟ U+01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
in composed form.
Which is a bad thing, as we got a bunch of special cases of the characters
that can be created through the combination mechanism!
[3] Example:
the compatibility character ℎ
U+210E PLANCK CONSTANT
should rather be encoded as
<font markup> U+0068 LATIN SMALL LETTER H
On the one hand, those compatibility characters are a hack, on the other
hand:
• the normalization routines should know about the markup language, as
they would drop too much information when they transform, for example
plancks constant into a simple h without markup
• many unicode fonts already contain glyphs for those characters. If we
can't provide markup for /that special one/ character, we lose it
though we could display it if we would use some other normalization.
For example look at ℕ U+2115 DOUBLE-STRUCK CAPITAL N.
[4] http://www.w3.org/TR/charmod-norm/
This archive was generated by hypermail 2.2.0 : Sun Jul 13 2008 - 16:06:42 UTC