[hackers] [libgrapheme] Consistently refer to "codepoints" as "codepoints", not "code points" || Laslo Hunhold from git_AT_suckless.org on 2021-12-18 (hackers mail list archive)

From: <git_AT_suckless.org>
Date: Sat, 18 Dec 2021 13:27:43 +0100 (CET)

commit 59952de9863572fbca88c3f9f1292709d381407b
Author: Laslo Hunhold <dev_AT_frign.de>
AuthorDate: Sat Dec 18 13:24:30 2021 +0100
Commit: Laslo Hunhold <dev_AT_frign.de>
CommitDate: Sat Dec 18 13:24:30 2021 +0100

    Consistently refer to "codepoints" as "codepoints", not "code points"

    Both are valid forms and Unicode prefers the latter, but maybe it's
    because I'm a German speaker (known for ridiculous compound words like
    "Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamter")
    and like compound words that I prefer the former.

    Signed-off-by: Laslo Hunhold <dev_AT_frign.de>

diff --git a/gen/util.c b/gen/util.c
index 7b684ba..3d5d26a 100644
--- a/gen/util.c
+++ b/gen/util.c
_AT_@ -325,7 +325,7 @@ segment_test_callback(char *fname, char **field, size_t nfields, char *comment,
                                 return 1;
                         }
                 } else {
- /* add code point to cp-array */
+ /* add codepoint to cp-array */
                         if ((t->cp = realloc(t->cp, ++t->cplen *
                                              sizeof(*t->cp))) == NULL) {
                                 fprintf(stderr, "segment_test_callback: realloc: %s.\n", strerror(errno));
diff --git a/man/grapheme_character_isbreak.3 b/man/grapheme_character_isbreak.3
index a900dc9..8d813ec 100644
--- a/man/grapheme_character_isbreak.3
+++ b/man/grapheme_character_isbreak.3
_AT_@ -3,7 +3,7 @@
.Os suckless.org
.Sh NAME
.Nm grapheme_character_isbreak
-.Nd test for a grapheme cluster break between two code points
+.Nd test for a grapheme cluster break between two codepoints
.Sh SYNOPSIS
.In grapheme.h
.Ft size_t
_AT_@ -13,7 +13,7 @@ The
.Fn grapheme_character_isbreak
function determines if there is a grapheme cluster break (see
.Xr libgrapheme 7 )
-between the two code points
+between the two codepoints
.Va cp1
and
.Va cp2 .
_AT_@ -33,7 +33,7 @@ The
.Fn grapheme_character_isbreak
function returns
.Va true
-if there is a grapheme cluster break between the code points
+if there is a grapheme cluster break between the codepoints
.Va cp1
and
.Va cp2
diff --git a/man/grapheme_utf8_decode.3 b/man/grapheme_utf8_decode.3
index 6d90c32..69352a8 100644
--- a/man/grapheme_utf8_decode.3
+++ b/man/grapheme_utf8_decode.3
_AT_@ -3,7 +3,7 @@
.Os suckless.org
.Sh NAME
.Nm grapheme_utf8_decode
-.Nd decode first code point in UTF-8-encoded string
+.Nd decode first codepoint in UTF-8-encoded string
.Sh SYNOPSIS
.In grapheme.h
.Ft size_t
_AT_@ -11,20 +11,20 @@
.Sh DESCRIPTION
The
.Fn grapheme_utf8_decode
-function decodes the next code point in the UTF-8-encoded string
+function decodes the next codepoint in the UTF-8-encoded string
.Va str
of length
.Va len .
If the UTF-8-sequence is invalid (overlong encoding, unexpected byte,
string ends unexpectedly, empty string, etc.) the decoding is stopped
-at the last processed byte and the decoded code point set to
+at the last processed byte and the decoded codepoint set to
.Dv GRAPHEME_INVALID_CODE_POINT.
.Pp
If
.Va cp
is not
.Dv NULL
-the decoded code point is stored in the memory pointed to by
+the decoded codepoint is stored in the memory pointed to by
.Va cp .
.Pp
Given NUL has a unique 1 byte representation, it is safe to operate on
diff --git a/man/grapheme_utf8_encode.3 b/man/grapheme_utf8_encode.3
index a2f05c8..c56f2ca 100644
--- a/man/grapheme_utf8_encode.3
+++ b/man/grapheme_utf8_encode.3
_AT_@ -3,7 +3,7 @@
.Os suckless.org
.Sh NAME
.Nm grapheme_utf8_encode
-.Nd encode code point into UTF-8 string
+.Nd encode codepoint into UTF-8 string
.Sh SYNOPSIS
.In grapheme.h
.Ft size_t
_AT_@ -11,7 +11,7 @@
.Sh DESCRIPTION
The
.Fn grapheme_utf8_encode
-function encodes the code point
+function encodes the codepoint
.Va cp
into a UTF-8-string.
If
diff --git a/man/libgrapheme.7 b/man/libgrapheme.7
index 4071602..dc3e83e 100644
--- a/man/libgrapheme.7
+++ b/man/libgrapheme.7
_AT_@ -29,25 +29,25 @@ making up a written language). ASCII for instance, which comprises the
range 0 to 127, assigns the number 65 (0x41) to the abstract character
.Sq A .
This number is called a
-.Dq code point ,
-and all code points of an encoding make up its so-called
+.Dq codepoint ,
+and all codepoints of an encoding make up its so-called
.Dq code space .
.Pp
Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
-first 128 code points are identical to ASCII's. The additional code
+first 128 codepoints are identical to ASCII's. The additional code
points are needed as Unicode's goal is to express all writing systems
of the world. To give an example, the abstract character
.Sq \[u00C4]
-is not expressable in ASCII, given no ASCII code point has been assigned
-to it. It can be expressed in Unicode, though, with the code point 196
+is not expressable in ASCII, given no ASCII codepoint has been assigned
+to it. It can be expressed in Unicode, though, with the codepoint 196
(0xC4).
.Pp
One may assume that this process is straightfoward, but as more and
-more code points were assigned to abstract characters, the Unicode
+more codepoints were assigned to abstract characters, the Unicode
Consortium (that defines the Unicode standard) was facing a problem:
Many (mostly non-European) languages have such a large amount of
abstract characters that it would exhaust the available Unicode code
-space if one tried to assign a code point to each abstract character. The
+space if one tried to assign a codepoint to each abstract character. The
solution to that problem is best introduced with an example: Consider
the abstract character
.Sq \[u01DE] ,
_AT_@ -63,9 +63,9 @@ of the
.Dq base character
.Sq A .
.Pp
-The Unicode Consortium adapted this idea by assigning code points to
-modifications. For example, the code point 0x308 represents adding an
-umlaut and 0x304 represents adding a macron, and thus, the code point
+The Unicode Consortium adapted this idea by assigning codepoints to
+modifications. For example, the codepoint 0x308 represents adding an
+umlaut and 0x304 represents adding a macron, and thus, the codepoint
sequence
.Dq 0x41 0x308 0x304 ,
namely the base character
_AT_@ -73,15 +73,15 @@ namely the base character
followed by the umlaut and macron modifiers, represents the abstract
character
.Sq \[u01DE] .
-As a side-note, the single code point 0x1DE was also assigned to
+As a side-note, the single codepoint 0x1DE was also assigned to
.Sq \[u01DE] ,
which is a good example for the fact that there can be multiple
representations of a single abstract character in Unicode.
.Pp
-Expressing a single abstract character with multiple code points solved
+Expressing a single abstract character with multiple codepoints solved
the code space exhaustion-problem, and the concept has been greatly
expanded since its first introduction (emojis, joiners, etc.). A sequence
-(which can also have the length 1) of code points that belong together
+(which can also have the length 1) of codepoints that belong together
this way and represents an abstract character is called a
.Dq grapheme cluster .
.Pp
_AT_@ -89,12 +89,12 @@ In many applications it is necessary to count the number of
user-perceived characters, i.e. grapheme clusters, in a string. A good
example for this is a terminal text editor, which needs to properly align
characters on a grid. This is pretty simple with ASCII-strings, where you
-just count the number of bytes (as each byte is a code point and each
-code point is a grapheme cluster). With Unicode-strings, it is a common
+just count the number of bytes (as each byte is a codepoint and each
+codepoint is a grapheme cluster). With Unicode-strings, it is a common
mistake to simply adapt the ASCII-approach and count the number of code
points. This is wrong, as, for example, the sequence
.Dq 0x41 0x308 0x304 ,
-while made up of 3 code points, is a single grapheme cluster and
+while made up of 3 codepoints, is a single grapheme cluster and
represents the user-perceived character
.Sq \[u01DE] .
.Pp
_AT_@ -102,7 +102,7 @@ The proper way to segment a string into user-perceived characters
is to segment it into its grapheme clusters by applying the Unicode
grapheme cluster breaking algorithm (UAX #29). It is based on a complex
ruleset and lookup-tables and determines if a grapheme cluster ends or
-is continued between two code points. Libraries like ICU, which also
+is continued between two codepoints. Libraries like ICU, which also
offer this functionality, are often bloated, not correct, difficult to
use or not statically linkable. The motivation behind
.Nm
diff --git a/src/character.c b/src/character.c
index 2ee1a72..06aa8d3 100644
--- a/src/character.c
+++ b/src/character.c
_AT_@ -201,14 +201,14 @@ grapheme_character_nextbreak(const char *str)
          * the null byte for the reasons given above.
          */

- /* get first code point */
+ /* get first codepoint */
         len += grapheme_utf8_decode(str, (size_t)-1, &cp0);
         if (cp0 == GRAPHEME_INVALID_CODE_POINT) {
                 return len;
         }

         while (cp0 != 0) {
- /* get next code point */
+ /* get next codepoint */
                 ret = grapheme_utf8_decode(str + len, (size_t)-1, &cp1);

                 if (cp1 == GRAPHEME_INVALID_CODE_POINT ||
diff --git a/src/utf8.c b/src/utf8.c
index a74b8c1..8be67c9 100644
--- a/src/utf8.c
+++ b/src/utf8.c
_AT_@ -10,8 +10,8 @@
static const struct {
         uint8_t lower; /* lower bound of sequence first byte */
         uint8_t upper; /* upper bound of sequence first byte */
- uint_least32_t mincp; /* smallest non-overlong encoded code point */
- uint_least32_t maxcp; /* largest encodable code point */
+ uint_least32_t mincp; /* smallest non-overlong encoded codepoint */
+ uint_least32_t maxcp; /* largest encodable codepoint */
         /*
          * implicit: table-offset represents the number of following
          * bytes of the form 10xxxxxx (6 bits capacity each)
_AT_@ -129,7 +129,7 @@ grapheme_utf8_decode(const char *s, size_t n, uint_least32_t *cp)
                         return 1 + (i - 1);
                 }
                 /*
- * shift code point by 6 bits and add the 6 stored bits
+ * shift codepoint by 6 bits and add the 6 stored bits
                  * in s[i] to it using the bitmask 0x3F (00111111)
                  */
                 *cp = (*cp << 6) | (((const unsigned char *)s)[i] & 0x3F);
_AT_@ -139,7 +139,7 @@ grapheme_utf8_decode(const char *s, size_t n, uint_least32_t *cp)
             BETWEEN(*cp, UINT32_C(0xD800), UINT32_C(0xDFFF)) ||
             *cp > UINT32_C(0x10FFFF)) {
                 /*
- * code point is overlong encoded in the sequence, is a
+ * codepoint is overlong encoded in the sequence, is a
                  * high or low UTF-16 surrogate half (0xD800..0xDFFF) or
                  * not representable in UTF-16 (>0x10FFFF) (RFC-3629
                  * specifies the latter two conditions)
_AT_@ -158,7 +158,7 @@ grapheme_utf8_encode(uint_least32_t cp, char *s, size_t n)
         if (BETWEEN(cp, UINT32_C(0xD800), UINT32_C(0xDFFF)) ||
             cp > UINT32_C(0x10FFFF)) {
                 /*
- * code point is a high or low UTF-16 surrogate half
+ * codepoint is a high or low UTF-16 surrogate half
                  * (0xD800..0xDFFF) or not representable in UTF-16
                  * (>0x10FFFF), which RFC-3629 deems invalid for UTF-8.
                  */
diff --git a/test/utf8-decode.c b/test/utf8-decode.c
index 537694b..1de282c 100644
--- a/test/utf8-decode.c
+++ b/test/utf8-decode.c
_AT_@ -11,7 +11,7 @@ static const struct {
         char *arr; /* UTF-8 byte sequence */
         size_t len; /* length of UTF-8 byte sequence */
         size_t exp_len; /* expected length returned */
- uint_least32_t exp_cp; /* expected code point returned */
+ uint_least32_t exp_cp; /* expected codepoint returned */
} dec_test[] = {
         {
                 /* empty sequence
diff --git a/test/utf8-encode.c b/test/utf8-encode.c
index dc9090b..6dd5637 100644
--- a/test/utf8-encode.c
+++ b/test/utf8-encode.c
_AT_@ -8,42 +8,42 @@
#include "util.h"

static const struct {
- uint_least32_t cp; /* input code point */
+ uint_least32_t cp; /* input codepoint */
         char *exp_arr; /* expected UTF-8 byte sequence */
         size_t exp_len; /* expected length of UTF-8 sequence */
} enc_test[] = {
         {
- /* invalid code point (UTF-16 surrogate half) */
+ /* invalid codepoint (UTF-16 surrogate half) */
                 .cp = UINT32_C(0xD800),
                 .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD },
                 .exp_len = 3,
         },
         {
- /* invalid code point (UTF-16-unrepresentable) */
+ /* invalid codepoint (UTF-16-unrepresentable) */
                 .cp = UINT32_C(0x110000),
                 .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD },
                 .exp_len = 3,
         },
         {
- /* code point encoded to a 1-byte sequence */
+ /* codepoint encoded to a 1-byte sequence */
                 .cp = 0x01,
                 .exp_arr = (char *)(unsigned char[]){ 0x01 },
                 .exp_len = 1,
         },
         {
- /* code point encoded to a 2-byte sequence */
+ /* codepoint encoded to a 2-byte sequence */
                 .cp = 0xFF,
                 .exp_arr = (char *)(unsigned char[]){ 0xC3, 0xBF },
                 .exp_len = 2,
         },
         {
- /* code point encoded to a 3-byte sequence */
+ /* codepoint encoded to a 3-byte sequence */
                 .cp = 0xFFF,
                 .exp_arr = (char *)(unsigned char[]){ 0xE0, 0xBF, 0xBF },
                 .exp_len = 3,
         },
         {
- /* code point encoded to a 4-byte sequence */
+ /* codepoint encoded to a 4-byte sequence */
                 .cp = UINT32_C(0xFFFFF),
                 .exp_arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF, 0xBF },
                 .exp_len = 4,
Received on Sat Dec 18 2021 - 13:27:43 CET

This archive was generated by hypermail 2.3.0 : Sat Dec 18 2021 - 13:36:29 CET