Skip to content

Commit

Permalink
feat: update talks to include video link and description
Browse files Browse the repository at this point in the history
  • Loading branch information
madcampos committed Sep 24, 2024
1 parent 7a9f9f9 commit f7ba482
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/content/talks/tojs-forms.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ eventUrl: https://guild.host/events/torontojs-online-techtalk-z9gqim
isOnline: true
date: 2024-08-29
slides: https://1drv.ms/p/s!AivyfQGK_lAiysNtXGfNfJvYfM8edA?e=aFtBFR
video: https://www.youtube.com/watch?v=1DtWgmMAI60
code: https://github.com/madcampos/dnh
techStack:
- HTML
Expand Down
38 changes: 38 additions & 0 deletions src/content/talks/tojs-unicode.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,41 @@ techStack:
- JavaScript
- Unicode strings
---
What happens when you are faced with some of the most mind bending things in computer history? Does the abyss stare back at you?

## Undestanding Unicode

In the ancient times, there was no standard on how to represent text by computers. Then IBM created ASCII, a way to represent text by computers. It was good but only represented latin scripts, so everything else was excluded.

Then came the idea to represent every and all possible scripts made by humans (including things like [Linear A](https://en.wikipedia.org/wiki/Linear_A) and [Linear B](https://en.wikipedia.org/wiki/Linear_B)). The problem then became the size those "characters" would take. Even the most simple text would explore in size as we would need more bits to represent everything.

The ingenious idea of Unicode was to have a _variable length_ encoding for code points. Code points are how Unicode identifies the idea of a _grapheme_. For now let's think of graphemes as "characters".

## Surrogate pairs

The way Unicode encodes the _variable length_ is using a thing called "surrogate pairs", that means that instead of having every single grapheme be a single "unit" some graphemes are composed of multiple units, where the first one says "hey, here comes a more complex grapheme that needs more space", and then the following ones indicate what the grapheme actually is.

## Graphemes

A grapheme can be very philosophical as it is a "unit of text", if usually means a "character", but can be a character _with an extra marking_ like "á", "ô", or "ñ".

The most interesting part is that graphemes are language dependant, what is a single character for one language may be a combination of characters for another one.

## Strings in JavaScript

All strings in JavaScript are manipulated using UTF8, that means they are split in 8bit "chunks", even though internally they are stored in UTF16.

That means, when we use methods like [`split`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split), it will break a surrogate pair into it's component parts.

To avoid this, we can use a very clever trick and _destructure_ a string, that will make each part retain surrogate pairs.

Other tools to deal with strings include:
- [`String.normalize`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) for making characters behave consistently.
- [RegExp's Unicode Character Classes](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape) for filtering out specific groups of characters in a more manageable way.
- [`Intl.Segmenter`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) to split strings by letter, word or phrase.

## Old Web Text

A while back I wrote a small application that would take any text and map it back to different characters so it is _visually_ different.

This is done by using the tools mentioned above, so a string of text gets normalized, cleaned, and then mapped to something else.

0 comments on commit f7ba482

Please sign in to comment.