Re: [dev] Problem with windows name from Страхиња Радић on 2023-07-10 (dev mail list archive)

From: Страхиња Радић <contact_AT_strahinja.org>
Date: Mon, 10 Jul 2023 07:54:14 +0200

I debugged dwm, adding to drw.c:

        static void
        log_msg(const char* fmt, ...)
        {
                char buf[4096];
                va_list args;
                va_start(args, fmt);
                vsnprintf(buf, sizeof(buf), fmt, args);
                va_end(args);
                fprintf(logfile, "%s\n", buf);
        }

and calls to utf8decodebyte:

        static long
        utf8decodebyte(const char c, size_t *i)
        {
               for (*i = 0; *i < (UTF_SIZ + 1); ++(*i))
                       if (((unsigned char)c & utfmask[*i]) == utfbyte[*i])
                       {
                               log_msg("*i = %lu, for '%c' returning '%c'",
                                       *i, c, (unsigned char)c & ~utfmask[*i]);
                               return (unsigned char)c & ~utfmask[*i];
                       }
               return 0;
        }

and drw_text:
                utf8str = text;
                nextfont = NULL;
                while (*text) {
                               log_msg("*text == '0x%X' == '%c'",
                                       *text, *text);
                               utf8charlen = utf8decode(text, &utf8codepoint, UTF_SIZ);
                               for (curfont = drw->fonts; curfont; curfont = curfont->next) {
                                       charexists = charexists || XftCharExists(drw->dpy, curfont->xfont, utf8codepoint);

I got the following output from "thisátest.odt"

// á
*text == '0xFFFFFFE1' == '<E1>'
*i = 3, for '<E1>' returning '^A'
*i = 1, for 't' returning 't'
*text == '0x74' == 't'
*i = 1, for 't' returning 't'

and the following from "thisátestњ.odt":

// á
*text == '0xFFFFFFC3' == '<C3>'
*i = 2, for '<C3>' returning '^C'
*i = 0, for '<A1>' returning '!'
[...]
// њ
*text == '0xFFFFFFD1' == '<D1>'
*i = 2, for '<D1>' returning '^Q'
*i = 0, for '<9A>' returning '^Z'

From here, it seems that dwm is receiving correct UTF-8 representations of "á"
(0xC3 0xA1) and "њ" (0xD1 0x9A) for "thisátestњ.odt", but it receives
ISO 8859-1 representation of "á" (no wonder, given it is passed a STRING instead
of UTF8_STRING or COMPOUND_TEXT), 0xE1, followed by the next ASCII character,
0x74 ("t"), still interpreting the two as UTF-8 sequence, when those two bytes
form an invalid UTF-8. That invalid UTF-8 is further passed to libfreetype or
whatever, which just interrupts output at that point.

application/pgp-signature attachment: signature.asc

Received on Mon Jul 10 2023 - 07:54:14 CEST

This archive was generated by hypermail 2.3.0 : Mon Jul 10 2023 - 08:00:09 CEST