Re: [hackers] [libgrapheme] Refactor libgrapheme.7 || Laslo Hunhold

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Tue, 13 Oct 2020 14:45:42 +0200

Dear Laslo

Laslo Hunhold <dev_AT_frign.de> wrote:
> On Sat, 10 Oct 2020 22:47:01 +0200
> "Silvan Jegen" <s.jegen_AT_gmail.com> wrote:
>
> Dear Silvan,
>
> > I think libgrapheme is a very good idea! I have just one comment
> > below.
>
> thanks for taking your time to review this commit and give feedback. I
> really appreciate it!

Thank you for making the software world a better place! :)


> > > +The
> > > .Nm
> > > -is a C library for working with grapheme clusters. What are
> > > grapheme -clusters? In C, one usually uses 8-Bit unsigned integers
> > > (chars) to -store strings, and many people assume that one such
> > > char represents -one visible character in a printed output.
> > > +library provides functions to properly count characters
> > > +.Dq ( grapheme clusters )
> >
> > I feel like it should be made clear that from that point on, when the
> > man page mentions a "character" it refers to a grapheme cluster. The
> > reader can then either look up its definition or you could give a
> > short description together with it.
> >
> > That should make the uses of "character" further down less amibiguous
> > for the reader who is not familiar with the concept of a grapheme
> > cluster.
> >
> > Just my two cents!
>
> This is a good point. I worked this into the manpage in commit
> 706b4d4c. Is it clearer now or do you have further ideas to improve it?

It definitely looks (even) better to me now!

Below I have added a few more comments for the rendered man page.


> In many applications, it is necessary to count the number of user-per‐

I *think* it's better to leave out the first comma here but I could be
wrong about this.

> ceived characters, i.e. grapheme clusters, in a string. This is pretty

I would add an example of why you would want to know how many perceived
characters are in a UTF-8 string (to really drive home the point),
so something like:

In many applications, it is necessary to count the number of user-per‐
ceived characters, i.e. grapheme clusters, in a string. *In a text editor,
for example, you need to know how many of these grapheme clusters you
have to draw on the screen so you can calculate if you have enough space
to do that on one line, etc.* This is pretty ...


> simple with ASCII-strings, where you just count the number of bytes (as
> each byte is a code point and each code point is a grapheme cluster).
> With Unicode-strings, it is a common mistake to simply adapt the ASCII-
> approach and count the number of code points. This is wrong, as, for ex‐
> ample, the sequence “0x41 0x308 0x304”, while made up of 3 code points,
> is a single grapheme cluster and represents the user-perceived character
> ‘Ǟ’.
>
> The proper way to segment a string into user-perceived characters is to
> segment it into its grapheme clusters by applying the Unicode grapheme
> cluster breaking algorithm (UAX #29). It is based on a complex ruleset
> and lookup-tables and determines if a grapheme cluster ends or is contin‐
> ued between two code points. Libraries like ICU, which also offer this
> functionality, are often bloated, not correct, difficult to use or not
> statically linkable. The motivation behind libgrapheme is to make
> grapheme cluster handling suck less and abide the UNIX philosophy.

s/abide/abide by/ ? Not sure, we may need some native speaker input here.

Otherwise the page looks great to me!


Kind regards,

Silvan
Received on Tue Oct 13 2020 - 14:45:42 CEST

This archive was generated by hypermail 2.3.0 : Tue Oct 13 2020 - 15:24:33 CEST