"Unicode" means "UCS-2" #230

icculus · 2022-06-23T14:16:30Z

It appears that when we say "Unicode" in SDL_ttf, we mean UCS-2 encoding (each char is 16-bits).

This covers the Basic Multilingual Plane, which covers an enormous amount of human language, but it does not cover the entirety of Unicode...and while probably no one cares about, I don't know, Klingon, the limitation means it can't do emoji glyphs, which people care about a lot.

The doesn't-break-ABI solution here is to say the "UNICODE" functions take UTF-16 encoding, which is an extension of UCS-2...most characters are the same, but there's some magic extension bits to make some codepoints take a two 16-bit value sequence, which gets you access to values > 0xFFFF. This is what win32 ended up doing, in WinXP or so, so all their Unicode functions didn't change but could handle the higher values when they show up in a string. UTF-16 is kind of the worst of all worlds: variable size like UTF-8 but wastes bits like UCS-4...but it gets the job done in a backwards compatible way.

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

Otherwise, probably look for STR_UNICODE in the source code and see where it gets used, and clean out UCS-2isms.

(If we do nothing, these higher codepoint values are available to apps if they encode their strings in UTF-8, since that can already represent those values.)

1bsyl · 2022-06-23T15:24:44Z

STR_UNICODE is used to tell that user is giving a UNICODE input parameter (so UCS-2), and to convert it to the internal string format used (eg UTF8), by calling: UCS2_to_UTF8()

so the "doesn't-break-ABI" would be to replace UCS2_to_UTF8() by a "UTF16_to_UTF8()" function ...
(and also UCS2_to_UTF8_len())

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2772

/* Gets the number of bytes needed to convert a UCS2 string to UTF-8 */
static size_t UCS2_to_UTF8_len(const Uint16 *text)
{
    size_t bytes = 1;
    while (*text) {
        Uint16 ch = *text++;
        if (ch <= 0x7F) {
            bytes += 1;
        } else if (ch <= 0x7FF) {
            bytes += 2;
        } else {
            bytes += 3;
        }
    }
    return bytes;
}

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2804

/* Convert a UCS-2 string to a UTF-8 string */
static void UCS2_to_UTF8(const Uint16 *src, Uint8 *dst)
{
    SDL_bool swapped = TTF_byteswapped;

    while (*src) {
        Uint16 ch = *src++;
        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }
        if (ch <= 0x7F) {
            *dst++ = (Uint8) ch;
        } else if (ch <= 0x7FF) {
            *dst++ = 0xC0 | (Uint8) ((ch >> 6) & 0x1F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        } else {
            *dst++ = 0xE0 | (Uint8) ((ch >> 12) & 0x0F);
            *dst++ = 0x80 | (Uint8) ((ch >> 6) & 0x3F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        }
    }
    *dst = '\0';
}

the breaking abi, is probably the same by using UTF32_to_UTF8 functions and changing API prototype to Uint32 *, instead of Uint16 *

1bsyl · 2022-06-24T07:46:28Z

@slouken
Shouldn't

        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }

be also present in UCS2_to_UTF8_len() ?

…UTF-8 conversion (see #230)

slouken · 2022-06-27T18:04:35Z

Yep!

smcv · 2022-06-29T12:36:04Z

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

This seems like a bad reason to break ABI. Environments that care about preserving ABI (like the Steam Runtime) would have to continue to ship the old SONAME in parallel with the new one forever, except the old SONAME would no longer be receiving bug fixes, which seems bad...

If you want UCS-4 support, I'd suggest having a new family of functions like TTF_RenderUCS4_Solid() which convert UCS-4 to UTF-8.

Expanding the UNICODE functions to re-interpret their parameter as UTF-16 instead of UCS-2 also seems a reasonable route to take. This is a compatible change, because "surrogates" (the escape characters used to encode non-BMP codepoints in UTF-16) are technically not considered to be valid UCS-2 anyway.

There are basically three strategies for dealing with Unicode:

standardize on UTF-8 and convert everything else to that (GTK, Harfbuzz, modern Linux in general, macOS, Rust)
standardize on UTF-16 (or historically UCS-2) and convert everything else to that (Qt, Windows, Java)
standardize on UCS-4 and convert everything else to that (rarely done)

SDL_ttf already converts its inputs to UTF-8 and works with UTF-8 internally, so it's basically already using the GTK/Linux/macOS/Rust strategy - which happens to be the one I prefer, because UTF-8 is fully backwards-compatible with ASCII, encodes "mostly-ASCII" text efficiently, is endian-neutral, and is overwhelmingly popular on the web.

UTF-16 combines the disadvantages of UCS-4 with the disadvantages of UTF-8, and I suspect nobody would be using it if Windows and Java hadn't needed an exit strategy from UCS-2.

UCS-4 is superficially appealing because each codepoint is fixed-byte-width, but things like combining characters and emoji modifiers mean that a codepoint isn't the same as a glyph, so counting codepoints is usually not actually the right thing to do.

slouken · 2022-06-29T13:36:21Z

There are already UCS4 versions of the API functions, and the next ABI break will remove everything but UTF-8 support, since that's what most people are already using and it's trivial to convert from UCS*/UTF* to that.

I'm not opposed to upgrading UCS2 to UTF-16, but otherwise I don't think we'll make any changes here.

1bsyl · 2022-06-30T08:03:11Z

I don't think we have UCS4 version of the API ! (I don't think we should add it)
We've got : UTF8, UNICODE(so UCS2), and TEXT(latin1).

1bsyl · 2022-06-30T08:03:55Z

If someone has a good knowledge of UCS2/UTF-16, it sounds like it's a 10/20 line patch by modifying the two previous functions ?

smcv · 2022-06-30T16:09:47Z

the next ABI break will remove everything but UTF-8 support

If that's the case, then perhaps just mark the non-UTF-8 APIs as deprecated and don't otherwise change them? It doesn't seem particularly useful to add UCS-4 or UTF-16 support if it's just going to be removed again.

1bsyl added a commit that referenced this issue Jun 24, 2022

Handle UNICODE_BOM_NATIVE/SWAPPED when calculating length in UCS2 to …

18caec2

…UTF-8 conversion (see #230)

slouken added this to the 3.0 milestone Jan 15, 2024

slouken mentioned this issue Jan 15, 2024

Proper UTF-16 and characters outside the BMP support #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Unicode" means "UCS-2" #230

"Unicode" means "UCS-2" #230

icculus commented Jun 23, 2022

1bsyl commented Jun 23, 2022

1bsyl commented Jun 24, 2022

slouken commented Jun 27, 2022

smcv commented Jun 29, 2022

slouken commented Jun 29, 2022

1bsyl commented Jun 30, 2022

1bsyl commented Jun 30, 2022

smcv commented Jun 30, 2022

"Unicode" means "UCS-2" #230

"Unicode" means "UCS-2" #230

Comments

icculus commented Jun 23, 2022

1bsyl commented Jun 23, 2022

1bsyl commented Jun 24, 2022

slouken commented Jun 27, 2022

smcv commented Jun 29, 2022

slouken commented Jun 29, 2022

1bsyl commented Jun 30, 2022

1bsyl commented Jun 30, 2022

smcv commented Jun 30, 2022