-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unicode" means "UCS-2" #230
Comments
STR_UNICODE is used to tell that user is giving a UNICODE input parameter (so UCS-2), and to convert it to the internal string format used (eg UTF8), by calling: UCS2_to_UTF8() so the "doesn't-break-ABI" would be to replace UCS2_to_UTF8() by a "UTF16_to_UTF8()" function ... https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2772 /* Gets the number of bytes needed to convert a UCS2 string to UTF-8 */
static size_t UCS2_to_UTF8_len(const Uint16 *text)
{
size_t bytes = 1;
while (*text) {
Uint16 ch = *text++;
if (ch <= 0x7F) {
bytes += 1;
} else if (ch <= 0x7FF) {
bytes += 2;
} else {
bytes += 3;
}
}
return bytes;
} https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2804 /* Convert a UCS-2 string to a UTF-8 string */
static void UCS2_to_UTF8(const Uint16 *src, Uint8 *dst)
{
SDL_bool swapped = TTF_byteswapped;
while (*src) {
Uint16 ch = *src++;
if (ch == UNICODE_BOM_NATIVE) {
swapped = SDL_FALSE;
continue;
}
if (ch == UNICODE_BOM_SWAPPED) {
swapped = SDL_TRUE;
continue;
}
if (swapped) {
ch = SDL_Swap16(ch);
}
if (ch <= 0x7F) {
*dst++ = (Uint8) ch;
} else if (ch <= 0x7FF) {
*dst++ = 0xC0 | (Uint8) ((ch >> 6) & 0x1F);
*dst++ = 0x80 | (Uint8) (ch & 0x3F);
} else {
*dst++ = 0xE0 | (Uint8) ((ch >> 12) & 0x0F);
*dst++ = 0x80 | (Uint8) ((ch >> 6) & 0x3F);
*dst++ = 0x80 | (Uint8) (ch & 0x3F);
}
}
*dst = '\0';
} the breaking abi, is probably the same by using UTF32_to_UTF8 functions and changing API prototype to |
@slouken if (ch == UNICODE_BOM_NATIVE) {
swapped = SDL_FALSE;
continue;
}
if (ch == UNICODE_BOM_SWAPPED) {
swapped = SDL_TRUE;
continue;
}
if (swapped) {
ch = SDL_Swap16(ch);
} be also present in |
Yep! |
This seems like a bad reason to break ABI. Environments that care about preserving ABI (like the Steam Runtime) would have to continue to ship the old SONAME in parallel with the new one forever, except the old SONAME would no longer be receiving bug fixes, which seems bad... If you want UCS-4 support, I'd suggest having a new family of functions like Expanding the There are basically three strategies for dealing with Unicode:
SDL_ttf already converts its inputs to UTF-8 and works with UTF-8 internally, so it's basically already using the GTK/Linux/macOS/Rust strategy - which happens to be the one I prefer, because UTF-8 is fully backwards-compatible with ASCII, encodes "mostly-ASCII" text efficiently, is endian-neutral, and is overwhelmingly popular on the web. UTF-16 combines the disadvantages of UCS-4 with the disadvantages of UTF-8, and I suspect nobody would be using it if Windows and Java hadn't needed an exit strategy from UCS-2. UCS-4 is superficially appealing because each codepoint is fixed-byte-width, but things like combining characters and emoji modifiers mean that a codepoint isn't the same as a glyph, so counting codepoints is usually not actually the right thing to do. |
There are already UCS4 versions of the API functions, and the next ABI break will remove everything but UTF-8 support, since that's what most people are already using and it's trivial to convert from UCS*/UTF* to that. I'm not opposed to upgrading UCS2 to UTF-16, but otherwise I don't think we'll make any changes here. |
I don't think we have UCS4 version of the API ! (I don't think we should add it) |
If someone has a good knowledge of UCS2/UTF-16, it sounds like it's a 10/20 line patch by modifying the two previous functions ? |
If that's the case, then perhaps just mark the non-UTF-8 APIs as deprecated and don't otherwise change them? It doesn't seem particularly useful to add UCS-4 or UTF-16 support if it's just going to be removed again. |
It appears that when we say "Unicode" in SDL_ttf, we mean UCS-2 encoding (each char is 16-bits).
This covers the Basic Multilingual Plane, which covers an enormous amount of human language, but it does not cover the entirety of Unicode...and while probably no one cares about, I don't know, Klingon, the limitation means it can't do emoji glyphs, which people care about a lot.
The doesn't-break-ABI solution here is to say the "UNICODE" functions take UTF-16 encoding, which is an extension of UCS-2...most characters are the same, but there's some magic extension bits to make some codepoints take a two 16-bit value sequence, which gets you access to values > 0xFFFF. This is what win32 ended up doing, in WinXP or so, so all their Unicode functions didn't change but could handle the higher values when they show up in a string. UTF-16 is kind of the worst of all worlds: variable size like UTF-8 but wastes bits like UCS-4...but it gets the job done in a backwards compatible way.
If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.
Otherwise, probably look for STR_UNICODE in the source code and see where it gets used, and clean out UCS-2isms.
(If we do nothing, these higher codepoint values are available to apps if they encode their strings in UTF-8, since that can already represent those values.)
The text was updated successfully, but these errors were encountered: