The BSD Cafe Journal

The BSD Cafe Journal: Your Daily Brew of BSD & Open Source News

Advertisement

Why “caffè” may not be “caffè”

A sample of various Unicode scripts

Every time when I think I finally “got” Unicode, I get kicked in the back by this rabbit hole. 😆 However, IMHO it is important to recognise that when moving data and files between operating systems and programs that you’re better off knowing some of the pitfalls. So I’m sharing something I experienced when I transferred a file to my FreeBSD Play-Around notebook. So let’s assume a little story…

It’s late afternoon and you and some friends sit together playing around with BSD. A friend using another operating system collects coffee orders in a little text file to not forget anyone when going to the barista on the other side of the street. He sends the file to you, so at the next meeting you already know the preferences of your friends. You take a look at who wants a caffè:

armin@freebsd:/tmp $ cat orders2.txt

Mauro: cappuccino
Armin: caffè doppio
Anna: caffè shakerato
Stefano: caffè
Franz: latte macchiato
Francesca: cappuccino
Carla: latte macchiato

So you do a quick grep just to be very surprised!

armin@freebsd:/tmp $ grep -i caffè orders2.txt
armin@freebsd:/tmp $

Wait, WAT? Why is there no output? We have more than one line with caffè in the file? Well, you just met one of the many aspects of Unicode. This time it’s called “normalization”. 😎

Many characters can be represented by more than one form.  Take the innocent “à” from the example above. There is an accented character in the Unicode characters called LATIN SMALL LETTER A WITH GRAVE. But you could also just use a regular LATIN SMALL LETTER A and combine it with the character COMBINING GRAVE ACCENT from the Unicode characters. Both result in the same character and “look” identical, but aren’t.

Let’s see a line with the word “caffè” as hex dump using the first approach (LATIN SMALL LETTER A WITH GRAVE):

\u0063\u0061\u0066\u0066\u00E8\u000A
c a f f è (LF)

Now let’s do the same for the same line using the second approach:

\u0063\u0061\u0066\u0066\u0065\u0300\u000A
c a f f è (LF)

And there you have it, the latter is a byte longer and the two lines do not match up even if both lines are encoded as UTF-8 and the character looks the same!

So obviously just using UTF-8 is not enough and you might encounter files using the second approach. Just to make matter more complicated there are actually four forms of Unicode normalization out there. 😆 

  • NFD: canonical decomposition
  • NFC: canonical decomposition followed by canonical composition
  • NFKD: compatible decomposition
  • NFKC: compatible decomposition followed by canonical composition.

For the sake of brevity of this post and your nerves we’ll just deal with the first two and I refer you to this Wikipedia article for the rest.

Normal form C (NFC) is the most widely used normal form and is also defined by the W3C for HTML, XML, and JavaScript. Technically speaking, encoding in Latin1 (or Windows Codepage 1252), for example, is in normal form C, since an “à” or the umlaut “Ö” is a single character and is not composed of combining characters. Windows and the .Net framework also store Unicode strings in Normal Form C. This does not mean that NFD can be ignored. For example, the Mac OSX file system works with a variant of NFD data, as the Unicode standard was only finalized when OSX was designed. When two applications share Unicode data, but normalize them differently, errors and data loss can result.

So how do we get from one form to another in one of the BSD operating systems (also in Linux)? Well, the Unicode Consortium provides a toolset called ICU — International Components for Unicode. The Documentation URL is https://unicode-org.github.io/icu/ and you can install that in FreeBSD using the command

pkg install icu

After completion of the installation you have a new command line tool called uconv (not to be mismatched with iconv which serves a similar purpose). Using uconv you can transcode the normal forms into each other as well do a lot of other encoding stuff (this tool is a rabbit hole in itself 😎).

Similar to iconv you can specify a “from” and a “to” encoding for input. But you can also specify so-called “transliterations” that will be applied to the input. In its simplest form such a transliteration is something in the form SOURCE-TARGET that specifies the operation. The "any" stands for any input character. This is the way I got the hexdump from above by using the transliteration 'any-hex':

armin@freebsd:/tmp$ echo caffè | uconv -x 'any-hex'
\u0063\u0061\u0066\u0066\u00E8\u000A

Instead of hex codes you can also output the Unicode code point names to see the difference between the two forms:

armin@freebsd:/tmp$ echo Caffè | uconv -f utf-8 -t utf-8 -x 'any-nfd' | uconv -f utf-8 -x 'any-name' \N{LATIN CAPITAL LETTER C}\N{LATIN SMALL LETTER A}\N{LATIN SMALL LETTER F}\N{LATIN SMALL LETTER F}\N{LATIN SMALL LETTER E}\N{COMBINING GRAVE ACCENT}\N{<control-000A>}

Now let’s try this for the NFC form:

armin@freebsd:/tmp$ echo Caffè | uconv -f utf-8 -t utf-8 -x 'any-nfc' | uconv -f utf-8 -x 'any-name'
\N{LATIN CAPITAL LETTER C}\N{LATIN SMALL LETTER A}\N{LATIN SMALL LETTER F}\N{LATIN SMALL LETTER F}\N{LATIN SMALL LETTER E WITH GRAVE}\N{<control-000A>}

You can also convert from one normal form to another by using a transliteration like 'any-nfd' to convert the input to the normal form D (for decomposed, e.g. LATIN SMALL CHARACTER A + COMBINING GRAVE ACCENT) or 'any-nfc' for the normal form C.

If you want to learn about building your own transliterations, there’s a tutorial at https://unicode-org.github.io/icu/userguide/transforms/general/rules.html that shows the enormous capabilities of uconv.

Using the 'name' transliteration you can easily discern the various Sigmas here (I’m using sed to split the output into multiple lines):

armin@freebsd:/tmp $ echo '∑𝛴Σ' | uconv -x 'any-name' | sed -e 's/\\N/\n/g'
{N-ARY SUMMATION}
{MATHEMATICAL ITALIC CAPITAL SIGMA}
{GREEK CAPITAL LETTER SIGMA}
{<control-000A>}

If you want to get the Unicode character from the name, there are several ways depending on the programming language you prefer. Here is an example using python that shows the German umlaut "Ö":

python -c 'import unicodedata; print(unicodedata.lookup(u"LATIN CAPITAL LETTER O WITH DIAERESIS"))'

The uconv utility is a very mighty thing and every modern programming language (see the Python example above) also has libraries and modules to support handling Unicode data. The world gets connected, but not in ASCII. 😎

One comment
Tomoaki AOKI

Of course, normalization of Unicode is a headache.
But for CJK writers, the larger headache is han-unification.

Multiple characters that have different meaning (CJK large character sets are ideographs, while alphabets are phonetic symbols) are mapped to single code point just because “it looks almost same for non-CJK standardization members”.

And in Japan, some characters with exactly same meaning and pronounciations has a variety of gryphs (called 異体字[itai-ji]), especially in natural human’s names and names of places.

Leave a Reply to Tomoaki AOKI Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.