We are in Middle-earth, writing about the hobbits' adventures on our website. And, to improve our SEO, we want to generate links between the realms and cities of Middle-earth. But for some reason, the page for Númenor is causing us trouble. After investigating a bit further, we discover that there are two different versions of "Númenor", even though they look identical.

What do you mean? They are exactly the same word! For us, yes, but not for the computer. Since each letter is represented internally by numbers, it is easy to see how we might end up with one combination that represents ú, and another that represents u plus a combining accent. But how do we know which one is which? Doesn’t JavaScript check? Warn us? Normalize it?

And no, it doesn’t. JavaScript is not the kind of language that warns us, because it always assumes we know what we're doing. The good news is that it does provide tools to deal with this mess. But before we get there, let’s understand the magic behind letters, the attempts to organize every written language, and how we normalize these differences. We are going on an adventure!

An Alphabet to Rule Them All

The basic idea is this: every character we type needs to be represented internally by numbers (not to be confused with sequences of bytes — that’s a topic for another post). Now, let's think about how many words exist in the world. Every spoken language that is represented in writing has its own particularities. And each character, to be represented on a computer, relied on a table that mapped it to a number.

The problem was that no one agreed on which table to use in the past. So each system invented its own. We have ASCII, which is quite famous and represents the basic English characters. We have ISO-8859-1, which represents accented characters used in Latin languages (like mine — Portuguese — with words such as café or programação), among many others. And if you were to read the word "programação" on a system different from mine, you might see something like "programa��o".

One does not simply mix character encodings — Of course this one had to be here (Image from Make a Meme)

That’s where Unicode comes in, with the idea of representing all human text in a uniform way. We can think of Unicode as a large dictionary, where each symbol is represented by a unique number called a code point. The symbol A has its own code point, just like ú, é, ç, or ã. Even an emoji like ✨ has a code point.

Númenor Is Always Númenor… Right?

Not always. Even though each symbol has its own code point, there can be more than one way to represent the same symbol. It sounds crazy, and it kind of is. Just think about how many writing systems exist, each with its own complexities, keyboards, and historical baggage. Unicode needs to be flexible enough to support all of them.

English/Japanese keyboard — Unicode, ganbatte! (Photo by Muhammad Nadhif Fajriananda on Unsplash)

In our case, some languages, keyboards, or older systems that were absorbed by Unicode represented the character ú in a precomposed form, where the symbol is already "ready", while others used a decomposed (or combining) form, where the character u is combined with a combining accent (U+0301). Unicode does not impose one choice over the other, so it allows both.

And here comes the big question: how do we know whether a precomposed character is equivalent to a combining one? For that, Unicode defines official rules called canonical forms, which determine when two texts, even if represented differently, have the same linguistic meaning.

Forging a Single Form

Now we reach the crucial part. We already understand what went wrong on our website, so it’s time to understand how to fix it. After all, people need to see the wonders of Númenor. As mentioned before, Unicode provides rules to determine whether text x is equal to text y. To do that, four normalization forms were defined to make texts equivalent.

NFD: Canonical Decomposition -> NFD is the starting point. It breaks characters down into their basic parts. Any character that can be decomposed will be transformed into a base letter plus combining marks. In this case, our ú becomes u + U+0301. This form is useful for text analysis, sorting, or removing accents, but it is not ideal for display or direct comparison.
NFC: Canonical Composition -> NFC starts exactly like NFD, by decomposing characters. The difference is that, afterwards, it tries to recompose the text using official precomposed characters whenever they exist. That way, u + U+0301 becomes ú again. This makes the text more predictable for comparison and storage, which is why NFC is the most commonly used form on the web, in APIs, and in databases.

At this point, you might wonder: what about the other two forms? Haven’t we covered all the possibilities already? Well, Unicode also defines compatible texts. And these go a bit beyond linguistic equivalence.

They're the same picture — “It’s the same character” - Unicode, probably

For example, the symbol Å (the Angstrom sign) is compatible with the letter Å. They are not exactly the same character and, depending on the context, they may have different meanings. Still, there are situations where they can be treated as equivalent.

NFKD: Compatibility Decomposition -> NFKD applies both canonical and compatibility decomposition, and then stops. It breaks everything down: equivalent and compatible characters alike, without any attempt to recompose them afterwards. With NFKD, Å becomes A + °, fully decomposed.
NFKC: Compatibility Composition -> NFKC starts exactly like NFKD, decomposing both equivalent and compatible characters. The difference is that it then tries to recompose the result into a common canonical form, if one exists. With NFKC, Å (the sign) becomes Å (the letter), recomposed into a canonical form.

JavaScript Doesn’t Understand Magic, Only Numbers

JavaScript, as usual, accepts everything without complaining. And although it is famous for performing many implicit conversions without our consent (such as the not-at-all-confusing type coercion), when it comes to normalizing characters, it assumes that this decision belongs to the application’s logic. And that’s not always the case.

const composition = "N\u00FAmenor"
const decomposition = "Nu\u0301menor"

console.log(composition === decomposition) // false
console.log(composition.length === decomposition.length) // false

Fortunately, JavaScript provides a safe way to handle this. The String object defines the normalize() method, which allows us to choose the normalization form according to our needs:

const compositionNFC = composition.normalize("NFC")
const decompositionNFC = decomposition.normalize() // defaults to NFC

console.log(compositionNFC === decompositionNFC) // true

Why Should We Care?

Knowledge is power! But more than that, this is a niche topic that becomes very relevant when security is involved. One important issue related to the difference between how we see a word and how a computer interprets it is spoofing. Consider the following:

gandalf  // g -> LATIN SMALL LETTER G (U+0067)
ɡandalf  // ɡ -> LATIN SMALL LETTER SCRIPT G (U+0261)

Depending on the font, these words can look exactly the same. For the computer, they are not. And this goes far beyond normalization, since normalization only deals with equivalent characters. In the case of gandalf, the difference is not in normalization; it is in Unicode itself. Unicode does not treat these characters as equivalent, and it shouldn’t. They are, in fact, different letters.

And So It Was Written

Just like in Middle-earth, names matter. And how they are written matters too. On our website, Númenor never stopped being Númenor for those who read it. But for the computer, everything depends on how that name was written.

Unicode exists to represent all human writing, with all its complexity. JavaScript, on the other hand, handles strings in a literal way. Understanding this difference and normalizing text consciously is an essential part of building reliable systems.

From My Reading List 🗞️

Here are a few things that caught my attention this month:

When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings - a similar topic to this post, with more in-depth explanations (but less Tolkien)
Kim7s Knowledge Hub - my mentor's website, for everything Kubernetes related
Rethinking “Pixel Perfect” Web Design - wholeheartedly agree with everything here
Why ‘boring’ VS Code keeps winning - as someone who has tried multiple code editors and keeps coming back to VS Code, yes!
Get Bored! - and speaking of boring: let's do it (my therapist also recommends it)

And the road goes ever on. See you in the next post 😁

How Many Ways Can We Write Númenor?

An Alphabet to Rule Them All

Númenor Is Always Númenor… Right?

Forging a Single Form

JavaScript Doesn’t Understand Magic, Only Numbers

Why Should We Care?

And So It Was Written

From My Reading List 🗞️

Comments

More from this blog

The Missing Piece

Command Palette

An Alphabet to Rule Them All

Númenor Is Always Númenor… Right?

Forging a Single Form

JavaScript Doesn’t Understand Magic, Only Numbers

Why Should We Care?

And So It Was Written

From My Reading List 🗞️

Comments

More from this blog