Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unicode" means "UCS-2" #230

Open
icculus opened this issue Jun 23, 2022 · 8 comments
Open

"Unicode" means "UCS-2" #230

icculus opened this issue Jun 23, 2022 · 8 comments
Milestone

Comments

@icculus
Copy link
Collaborator

icculus commented Jun 23, 2022

It appears that when we say "Unicode" in SDL_ttf, we mean UCS-2 encoding (each char is 16-bits).

This covers the Basic Multilingual Plane, which covers an enormous amount of human language, but it does not cover the entirety of Unicode...and while probably no one cares about, I don't know, Klingon, the limitation means it can't do emoji glyphs, which people care about a lot.

The doesn't-break-ABI solution here is to say the "UNICODE" functions take UTF-16 encoding, which is an extension of UCS-2...most characters are the same, but there's some magic extension bits to make some codepoints take a two 16-bit value sequence, which gets you access to values > 0xFFFF. This is what win32 ended up doing, in WinXP or so, so all their Unicode functions didn't change but could handle the higher values when they show up in a string. UTF-16 is kind of the worst of all worlds: variable size like UTF-8 but wastes bits like UCS-4...but it gets the job done in a backwards compatible way.

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

Otherwise, probably look for STR_UNICODE in the source code and see where it gets used, and clean out UCS-2isms.

(If we do nothing, these higher codepoint values are available to apps if they encode their strings in UTF-8, since that can already represent those values.)

@1bsyl
Copy link
Contributor

1bsyl commented Jun 23, 2022

STR_UNICODE is used to tell that user is giving a UNICODE input parameter (so UCS-2), and to convert it to the internal string format used (eg UTF8), by calling: UCS2_to_UTF8()

so the "doesn't-break-ABI" would be to replace UCS2_to_UTF8() by a "UTF16_to_UTF8()" function ...
(and also UCS2_to_UTF8_len())

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2772

/* Gets the number of bytes needed to convert a UCS2 string to UTF-8 */
static size_t UCS2_to_UTF8_len(const Uint16 *text)
{
    size_t bytes = 1;
    while (*text) {
        Uint16 ch = *text++;
        if (ch <= 0x7F) {
            bytes += 1;
        } else if (ch <= 0x7FF) {
            bytes += 2;
        } else {
            bytes += 3;
        }
    }
    return bytes;
}

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2804

/* Convert a UCS-2 string to a UTF-8 string */
static void UCS2_to_UTF8(const Uint16 *src, Uint8 *dst)
{
    SDL_bool swapped = TTF_byteswapped;

    while (*src) {
        Uint16 ch = *src++;
        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }
        if (ch <= 0x7F) {
            *dst++ = (Uint8) ch;
        } else if (ch <= 0x7FF) {
            *dst++ = 0xC0 | (Uint8) ((ch >> 6) & 0x1F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        } else {
            *dst++ = 0xE0 | (Uint8) ((ch >> 12) & 0x0F);
            *dst++ = 0x80 | (Uint8) ((ch >> 6) & 0x3F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        }
    }
    *dst = '\0';
}

the breaking abi, is probably the same by using UTF32_to_UTF8 functions and changing API prototype to Uint32 *, instead of Uint16 *

@1bsyl
Copy link
Contributor

1bsyl commented Jun 24, 2022

@slouken
Shouldn't

        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }

be also present in UCS2_to_UTF8_len() ?

1bsyl added a commit that referenced this issue Jun 24, 2022
@slouken
Copy link
Collaborator

slouken commented Jun 27, 2022

Yep!

@smcv
Copy link
Contributor

smcv commented Jun 29, 2022

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

This seems like a bad reason to break ABI. Environments that care about preserving ABI (like the Steam Runtime) would have to continue to ship the old SONAME in parallel with the new one forever, except the old SONAME would no longer be receiving bug fixes, which seems bad...

If you want UCS-4 support, I'd suggest having a new family of functions like TTF_RenderUCS4_Solid() which convert UCS-4 to UTF-8.

Expanding the UNICODE functions to re-interpret their parameter as UTF-16 instead of UCS-2 also seems a reasonable route to take. This is a compatible change, because "surrogates" (the escape characters used to encode non-BMP codepoints in UTF-16) are technically not considered to be valid UCS-2 anyway.

There are basically three strategies for dealing with Unicode:

  • standardize on UTF-8 and convert everything else to that (GTK, Harfbuzz, modern Linux in general, macOS, Rust)
  • standardize on UTF-16 (or historically UCS-2) and convert everything else to that (Qt, Windows, Java)
  • standardize on UCS-4 and convert everything else to that (rarely done)

SDL_ttf already converts its inputs to UTF-8 and works with UTF-8 internally, so it's basically already using the GTK/Linux/macOS/Rust strategy - which happens to be the one I prefer, because UTF-8 is fully backwards-compatible with ASCII, encodes "mostly-ASCII" text efficiently, is endian-neutral, and is overwhelmingly popular on the web.

UTF-16 combines the disadvantages of UCS-4 with the disadvantages of UTF-8, and I suspect nobody would be using it if Windows and Java hadn't needed an exit strategy from UCS-2.

UCS-4 is superficially appealing because each codepoint is fixed-byte-width, but things like combining characters and emoji modifiers mean that a codepoint isn't the same as a glyph, so counting codepoints is usually not actually the right thing to do.

@slouken
Copy link
Collaborator

slouken commented Jun 29, 2022

There are already UCS4 versions of the API functions, and the next ABI break will remove everything but UTF-8 support, since that's what most people are already using and it's trivial to convert from UCS*/UTF* to that.

I'm not opposed to upgrading UCS2 to UTF-16, but otherwise I don't think we'll make any changes here.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 30, 2022

I don't think we have UCS4 version of the API ! (I don't think we should add it)
We've got : UTF8, UNICODE(so UCS2), and TEXT(latin1).

@1bsyl
Copy link
Contributor

1bsyl commented Jun 30, 2022

If someone has a good knowledge of UCS2/UTF-16, it sounds like it's a 10/20 line patch by modifying the two previous functions ?

@smcv
Copy link
Contributor

smcv commented Jun 30, 2022

the next ABI break will remove everything but UTF-8 support

If that's the case, then perhaps just mark the non-UTF-8 APIs as deprecated and don't otherwise change them? It doesn't seem particularly useful to add UCS-4 or UTF-16 support if it's just going to be removed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants