Back

OmniFont: Every Unicode character in a single font

Peter Richards Founder

First of all, why don’t we have lots of fonts that cover all the Unicode characters? Well, there’s a few reasons.

  1. Technical Limitations The Truetype font format only designates enough room to support about 65k characters in a single font file. Unicode has a little over 1 million possible characters, but only about 150k of them are assigned. So still at least twice as many as a font file can fit.
  2. Practicality Even if a font file could fit all ~150k characters, it would be roughly 100mb or more in size. Having a user download 100mb of font data when they visit your website, only to use a small percent of it, would not be very practical.

But despite those hurdles, it would still be useful to have all the characters covered by a single font. Maybe your website has user generated content that could contain some character your website’s normal fonts won’t render. So how can we make it work?

Subsetting

Browsers support something called font subsetting. Say a font has Latin and Cyrillic characters in it. Your users who speak a Latin script language probably won’t need the Cyrillic and visa-versa. So we can split the font file into two smaller files, one for Latin, and one for Cyrillic. By default, when we put these two font definitions in our code, the browser will still download both files, but we can tell the browser to only download the file if it actually needs a character from it by using the unicode-range attribute. For example ASCII(characters 0-127) would be marked as unicode-range: U+00-7F;. This tells the browser to only download this font file if you need to render a character in this range, otherwise don’t bother.

Ok, great. So we can just subset each language script and that should work, right?

Well, it’s not an ideal solution. Languages don’t really fit neatly into unicode ranges. Here is an example of what an Arabic subset would look like:

unicode-range: U+0600-06FF, U+0750-077F, U+0870-088E, 
U+0890-0891, U+0898-08E1, U+08E3-08FF,
U+200C-200E, U+2010-2011, U+204F, U+2E41, 
U+FB50-FDFF, U+FE70-FE74, U+FE76-FEFC, 
U+102E0-102FB, U+10E60-10E7E, U+10EFD-10EFF, 
U+1EE00-1EE03, U+1EE05-1EE1F, U+1EE21-1EE22, 
U+1EE24, U+1EE27, U+1EE29-1EE32, U+1EE34-1EE37, 
U+1EE39, U+1EE3B, U+1EE42, U+1EE47, U+1EE49, 
U+1EE4B, U+1EE4D-1EE4F, U+1EE51-1EE52, U+1EE54, 
U+1EE57, U+1EE59, U+1EE5B, U+1EE5D, U+1EE5F, 
U+1EE61-1EE62, U+1EE64, U+1EE67-1EE6A, 
U+1EE6C-1EE72, U+1EE74-1EE77, U+1EE79-1EE7C, 
U+1EE7E, U+1EE80-1EE89, U+1EE8B-1EE9B, U+1EEA1-1EEA3, 
U+1EEA5-1EEA9, U+1EEAB-1EEBB, U+1EEF0-1EEF1;

Languages that use Chinese characters are often subset into 100 or more smaller files as there are nearly 100k Chinese characters in Unicode, the majority of which aren’t commonly used. For example including all the Noto Sans(A Google font with large coverage of Unicode) subsets from Google Fonts will add nearly 1MB of CSS to your website. To put that in context this page is roughly 100kb in size, so it would add 10x. Having to transmit all of this codified language information to the browser on each request, just to cover the occasional missing character is just not practical. Can we maybe ignore the language specifics? Or at least defer them until they’re needed?

Instead we can subset not based on language, but purely by number. Unicode ranges are denoted in hexadecimal. Two hexadecimal character(e.g. FF) can hold 256 numbers, so instead we can subset our font into 256 character chunks. File 0 would have the unicode range U+00??, File 1 would contain U+01??(or U+0100-01FF). Now we don’t need to transmit all that language information, we only need to transmit the number of groups, and we can easily calculate the unicode ranges client-side. So when the browser runs the code, it’s just setting up “If you need a character from U+0000-U+00FF, then grab file 0 from here. U+0100-U+01FF, get file 1, etc.”

Characters != Glyphs

But there’s a slight problem with our solution so far. A unicode character(or codepoint) is not the same as visual glyph. In English it is, but quite a lot of languages use combining characters, which are multiple characters that equate to a single visual character. An example would be ä. This can be represented either as character ä - 00E4 - LATIN SMALL LETTER A WITH DIAERESIS which is 1 character and 1 glyph, OR as a - 0061 - LATIN SMALL LETTER A followed by ◌̈ - 0308 - COMBINING DIAERESIS. Visually they are identical, but behind the scenes they’re a bit different. Currently our font will handle the first case fine. 00E4 will be in group 0, but the second case contains a combination of group 0, and group 3. The browser will happily display characters from different font files next to each other, but it won’t compose a single glyph from components from separate files. So are second case of will display a missing glyph icon(unless the user has a system font to display it).

Here’s where our sideskirting that language information has caught up to us…

Trojan Horse

What if we use a trojan horse type method? That ◌̈ - COMBINING DIAERESIS is not used in Chinese, or Thai or a whole slew of scripts, it’s only used in Latin, Cyrillic, and Greek. So what if we include it in the files like group 0 which covers basic ASCII/Latin, so that when the browser requests that file, it’s got both components in a single file should it need them. Sadly, it’s not quite that easy. Because we originally said that group 0 would have the unicode range U+00?? the browser will assume the file only has that same range too. We could change the unicode range to U+00??, U+0308, but now we’re quickly creeping back to what we were trying to avoid from the start, having to transmit lots of langauge information on every request. This also has that downside that if the browser needs ◌̈ - U+0308 it will download ALL subsets which have U+0308 in their definition.

But we can trick the browser one more time. We still include ◌̈ - U+0308 in our group 0 font file, so now the unavoidable language information is in the font file. Rather than sending it all to the browser each time, we lie to the browser with an overly simplistic view(e.g. group 0 contains only U+0000-U+00FF 😉) because this is easy to generate. If the browser needs to render from group 0, it downloads that file which has the 256 characters we said it would, PLUS whatever we’ve added(in this example ◌̈ - 0308 - COMBINING DIAERESIS), but the browser ignores that extra character because it’s not part of U+00??. So we setup a callback function that tells the browser “Every time you download one of these font files, tell me about it, I want to do something”. The browser downloads the group 0 file, it then tells us it downloaded group 0, and at that point we go back to all the definitions we setup originally(the one that linked U+00?? to font file 0) and we remove group 0 from those font definitions. Then we put it right back, exactly the same, EXCEPT we remove the unicode-range attribute. We only needed unicode-range so that the browser would only download the files it actually needs, since it’s just downloaded this file, we know it needs a character from it. So after we remove unicode-range and put up the new definition, without the conditions, the browser sees this new definition and says:

Oh, there’s a new font file here that I need to download. Wait a second…I actually already have that font file! No need to download it again. But wait…I don’t know what’s inside this font file because there’s no unicode-range listed. No problem, i’ll just parse it and take a look.

And when the browser parses it, it will find our 256 characters AND it will find our additional ◌̈ - 0308 character.

This way we’ve been able to avoid transmitting all the complexity of Japanese, Arabic, Thai, Indic scripts, and hundreds of others to the browser. Of course all that complexity still exists, we can’t avoid it completely, but instead of sending it ALL to the browser on every request, it’s lying in wait spread across hundreds of font files. And should the browser ever need even one character from one of those files, only at that point do we need to say “Wait a second, there’s a bit more to it than that”.

So we use an initially oversimplified representation in order to keep the payload size small, and easy to generate. We use this to determine what the browser actually needs. Then we take away the simplistic representation, and instead make the browser parse the files we already know it needs.

So this allows us to have a Javascript file that is ~600 bytes gzipped, that can render all the assigned Unicode characters, plus many common combining glyphs too. OmniFont also has pretty good coverage of multi character grapheme clusters. These aren’t technically a Unicode character, but rather a ligature of a multi-character sequence. For example, 👨‍👩‍👧‍👦 is actually 7 characters: 👨, 👩, 👧, and 👦, with a zero-width joiner in between. Many of these are handled correctly by OmniFont, though not all, but we’re continually fixing missing ones when we find them.

Pobody’s Nerfect

That is not to say that OmniFont will render every piece of text possible. There are things like Zalgo Text which is the combination of lots of combining characters to acheive a creepy effect. There’s also the Unicode Private Use Areas which are chunks designated for anything not officially part of Unicode, like musical notation, or fonts of constructed languages. Since these codepoints have infinite possible uses, they’re not included in OmniFont.

Some other downsides of this strategy are:

  1. Multiple Downloads Not all scripts will neatly line up with the 256 character division of OmniFont. For example, Chinese will typically need to download a handful of files as the commonly used characters are more spread across the Unicode range. This is why OmniFont is intended as a fallback font, rather than a primary font. If you know you’re going to be displaying Chinese text, you’re better off using a Chinese specific font and subset. However OmniFont font files are quite small(median size is 41kb) and served from a CDN, so loading should be quite fast.
  2. Style Mismatch In order to cover all the unicode characters OmniFont pulls from over 150 different font files. We pick fonts that have:
    1. a free, open license, like SIL Open Font License.
    2. a sans serif style similar to Noto Sans(the primarily used font family).
    3. ability to cover larger ranges of characters. If a range of charcters is currently using two fonts, then replacing those two fonts with a single font that can cover the whole range will improve style continuity as well as ensure font spacing/position is consistent too. However, there are a few places where sans serif fonts weren’t available(though only in the most rarely used parts of Unicode), so there are a few serif style characters. Still most users will prefer to see a character of a slightly different style than no character at all.

Conclusion

You can test out OmniFont here, and the code to use it is there too. It’s free to use.

Tekst skopiowany