[hackers] [libgrapheme] Improve a small edge-case in lg_utf8_decode() || Laslo Hunhold
commit faeaa564686873e4720a0c1ef9879f58347d754e
Author: Laslo Hunhold <dev_AT_frign.de>
AuthorDate: Sat Dec 18 01:04:37 2021 +0100
Commit: Laslo Hunhold <dev_AT_frign.de>
CommitDate: Sat Dec 18 01:23:32 2021 +0100
Improve a small edge-case in lg_utf8_decode()
Okay, this case is really crazy but possible: Before this change,
when we encountered e.g. a 0xF0 (which indicates a 4-byte-UTF-8
sequence and implies 3 subsequent continuation bytes) but have a
string-length of e.g. 2, we would automatically return 4 (> 2) no matter
how the following bytes look like to indicate that we need a larger
buffer.
However, it's actually necessary to check the subsequent bytes until
the buffer-end as we might have a case like
0xF0 0x80 0x00
where 0xF0 is followed by a single continuation byte but then the
continuation stops and we have a NUL-byte. It's more expected to
return 2 in such a situation because we obtain more information about
the string by inspecting the continuation bytes instead of throwing
our hands up so early.
Also add this to the test-cases of the decoder to prevent any
regressions.
Signed-off-by: Laslo Hunhold <dev_AT_frign.de>
diff --git a/src/utf8.c b/src/utf8.c
index c04fc0b..efd6068 100644
--- a/src/utf8.c
+++ b/src/utf8.c
_AT_@ -84,11 +84,29 @@ lg_utf8_decode(const char *s, size_t n, uint_least32_t *cp)
}
if (1 + off > n) {
/*
- * input is not long enough, set cp as invalid and
- * return number of bytes needed
+ * input is not long enough, set cp as invalid
*/
*cp = LG_INVALID_CODE_POINT;
- return 1 + off;
+
+ /*
+ * count the following continuation bytes, but nothing
+ * else in case we have a "rogue" case where e.g. such a
+ * sequence starter occurs right before a NUL-byte.
+ */
+ for (i = 0; 1 + i < n; i++) {
+ if(!BETWEEN(((const unsigned char *)s)[1 + i],
+ 0x80, 0xBF)) {
+ break;
+ }
+ }
+
+ /*
+ * if the continuation bytes do not continue until
+ * the end, return the incomplete sequence length.
+ * Otherwise return the number of bytes we actually
+ * expected, which is larger than n.
+ */
+ return ((1 + i) < n) ? (1 + i) : (1 + off);
}
/*
diff --git a/test/utf8-decode.c b/test/utf8-decode.c
index 0749688..d98314c 100644
--- a/test/utf8-decode.c
+++ b/test/utf8-decode.c
_AT_@ -113,6 +113,16 @@ static const struct {
.exp_len = 1,
.exp_cp = LG_INVALID_CODE_POINT,
},
+ {
+ /* invalid 3-byte sequence (short string, second byte malformed)
+ * [ 11100000 01111111 ] ->
+ * INVALID
+ */
+ .arr = (char *)(unsigned char[]){ 0xE0, 0x7F },
+ .len = 2,
+ .exp_len = 1,
+ .exp_cp = LG_INVALID_CODE_POINT,
+ },
{
/* invalid 3-byte sequence (third byte missing)
* [ 11100000 10111111 ] ->
_AT_@ -183,6 +193,27 @@ static const struct {
.exp_len = 1,
.exp_cp = LG_INVALID_CODE_POINT,
},
+ {
+ /* invalid 4-byte sequence (short string 1, second byte malformed)
+ * [ 11110011 011111111 ] ->
+ * INVALID
+ */
+ .arr = (char *)(unsigned char[]){ 0xF3, 0x7F },
+ .len = 2,
+ .exp_len = 1,
+ .exp_cp = LG_INVALID_CODE_POINT,
+ },
+ {
+ /* invalid 4-byte sequence (short string 2, second byte malformed)
+ * [ 11110011 011111111 10111111 ] ->
+ * INVALID
+ */
+ .arr = (char *)(unsigned char[]){ 0xF3, 0x7F, 0xBF },
+ .len = 3,
+ .exp_len = 1,
+ .exp_cp = LG_INVALID_CODE_POINT,
+ },
+
{
/* invalid 4-byte sequence (third byte missing)
* [ 11110011 10111111 ] ->
_AT_@ -203,6 +234,16 @@ static const struct {
.exp_len = 2,
.exp_cp = LG_INVALID_CODE_POINT,
},
+ {
+ /* invalid 4-byte sequence (short string, third byte malformed)
+ * [ 11110011 10111111 01111111 ] ->
+ * INVALID
+ */
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0x7F },
+ .len = 3,
+ .exp_len = 2,
+ .exp_cp = LG_INVALID_CODE_POINT,
+ },
{
/* invalid 4-byte sequence (fourth byte missing)
* [ 11110011 10111111 10111111 ] ->
Received on Sat Dec 18 2021 - 01:23:56 CET
This archive was generated by hypermail 2.3.0
: Sat Dec 18 2021 - 01:24:34 CET