Re: [dev] [libgrapheme] Some questions about libgrapheme

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 2 Sep 2022 11:10:31 +0200

On Thu, 01 Sep 2022 21:43:06 -0300
atrtarget_AT_cock.li wrote:

Dear atrtarget,

thanks for reaching out!

> libgrapheme looks really useful, but I still don't get some things
> from it. For example, if I need to get back one grapheme, how should
> I do it since there's no `grapheme_prev_character_break`?

This is difficult to achieve, given the Unicode standard pretty much
gave up on specifying it (see [0]). Going backwards would first require
you to know that you're in a "safe spot" and not in the middle of
nowhere. Going back to a safe spot, though, is unspecified.

C as a language also makes it a bit inelegant to go "backwards" in a
string. The only way I could think of is a function prototype of the
form

size_t grapheme_prev_character_break(const uint_least32_t *str,
                                     size_t strlen, size_t offset);

where the offset is the "starting" point you want to go back from,
returning the offset of the previous breakpoint.

A trivial heuristic could be to go backwards until the
breakpoint-detector stops _before_ the specified offset, however, to be
on the completely safe side I imagine that it must be done such that
you even go back further to "see" two breakpoints (i.e. including the
breakpoint before the desired previous breakpoint to force
self-synchronization).

The problem with this heuristic is that the algorithm can become very
inefficient, especially when you have long preceding segments. If n is
the offset-length, the worst-case runtime could be O((n-1)!) for a
segment that is in fact of length n-1, because of the single backsteps
it has to take.

> And to get the number of columns a character takes up, should I
> convert everything to wchar and use `wcswidth`? In my case that would
> be very inefficient :( Thanks for reading

Unicode explictly warns against using the EastAsianWidth-property
(which is what wcswidth uses behind the scenes) to determine the
column-size of a string (see [1]):

        [...] the guidelines on use of this property should be
        considered recommendations based on a particular legacy
        practice that may be overridden by implementations as necessary.

and

        Note: The East_Asian_Width property is not intended for use by
        modern terminal emulators without appropriate tailoring on a
        case-by-case basis. Such terminal emulators need a way to
        resolve the halfwidth/fullwidth dichotomy that is necessary for
        such environments, but the East_Asian_Width property does not
        provide an off-the-shelf solution for all situations. The
        growing repertoire of the Unicode Standard has long exceeded
        the bounds of East Asian legacy character encodings, and
        terminal emulations often need to be customized to support edge
        cases and for changes in typographical behavior over time.

So, in other words: EAW was added to the standard decades ago and now
they're stuck with it. They don't recommend it without tailoring.
What does the tailoring look like? Unspecified, because this is a
text-rendering thing and impossible to solve on a "logical" basis.

I have the goal that libgrapheme only offers interfaces that work as
intended and not hacks that "usually" work or "have always worked"
based on a misinterpretation. This half-assed approach has already led
to many problems before in the context of text-handling in software.

The proper way to solve the column-problem is to render each grapheme
cluster and see how wide the font-rendering-library renders it, given
it depends on the font. I know that this isn't satisfactory, but that's
how it is.

I hope that this answers your questions.

With best regards

Laslo

[0]:https://unicode.org/reports/tr29/#Random_Access
[1]:https://www.unicode.org/reports/tr11/tr11-39.html#Scope
Received on Fri Sep 02 2022 - 11:10:31 CEST

This archive was generated by hypermail 2.3.0 : Fri Sep 02 2022 - 11:12:09 CEST