I debugged dwm, adding to drw.c:
static void
log_msg(const char* fmt, ...)
{
char buf[4096];
va_list args;
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
fprintf(logfile, "%s\n", buf);
}
and calls to utf8decodebyte:
static long
utf8decodebyte(const char c, size_t *i)
{
for (*i = 0; *i < (UTF_SIZ + 1); ++(*i))
if (((unsigned char)c & utfmask[*i]) == utfbyte[*i])
{
log_msg("*i = %lu, for '%c' returning '%c'",
*i, c, (unsigned char)c & ~utfmask[*i]);
return (unsigned char)c & ~utfmask[*i];
}
return 0;
}
and drw_text:
utf8str = text;
nextfont = NULL;
while (*text) {
log_msg("*text == '0x%X' == '%c'",
*text, *text);
utf8charlen = utf8decode(text, &utf8codepoint, UTF_SIZ);
for (curfont = drw->fonts; curfont; curfont = curfont->next) {
charexists = charexists || XftCharExists(drw->dpy, curfont->xfont, utf8codepoint);
I got the following output from "thisátest.odt"
// á
*text == '0xFFFFFFE1' == '<E1>'
*i = 3, for '<E1>' returning '^A'
*i = 1, for 't' returning 't'
*text == '0x74' == 't'
*i = 1, for 't' returning 't'
and the following from "thisátestњ.odt":
// á
*text == '0xFFFFFFC3' == '<C3>'
*i = 2, for '<C3>' returning '^C'
*i = 0, for '<A1>' returning '!'
[...]
// њ
*text == '0xFFFFFFD1' == '<D1>'
*i = 2, for '<D1>' returning '^Q'
*i = 0, for '<9A>' returning '^Z'
From here, it seems that dwm is receiving correct UTF-8 representations of "á"
(0xC3 0xA1) and "њ" (0xD1 0x9A) for "thisátestњ.odt", but it receives
ISO 8859-1 representation of "á" (no wonder, given it is passed a STRING instead
of UTF8_STRING or COMPOUND_TEXT), 0xE1, followed by the next ASCII character,
0x74 ("t"), still interpreting the two as UTF-8 sequence, when those two bytes
form an invalid UTF-8. That invalid UTF-8 is further passed to libfreetype or
whatever, which just interrupts output at that point.
Received on Mon Jul 10 2023 - 07:54:14 CEST