Does anybody know of a simple script-detection algorithm (or heuristic) for font switching?

This came up with one of my clients. Suppose you have a guest book on your web site, and seven visitors left you the following inspiring messages:

  1. すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについて平等である。
  2. 人人生而自由,在尊严和权利上一律平等。
  3. Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.
  4. 人人生而自由,在尊嚴和權利上一律平等。
  5. Alle Menschen sind frei und gleich an Würde und Rechten geboren.
  6. ‘Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.
  7. 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.

(It looks like your visitors all read the Universal Declaration of Human Rights courtesy of the UDHR in Unicode project).

Now suppose you are so touched that you want to lay out all seven messages in a PDF file, and print it out as a booklet.  You have a beautiful layout template, and various complementary fonts: Latin script, Japanese, Korean, simplified Chinese, Traditional Chinese, and Greek script.

Which font to you apply to each message?  More importantly, is there a simple heuristic by which software can make the choice? (More after the jump.)

Now, being clever humans, we can figure out which font to apply to which text.  It’s Japanese for #1 above, simplified Chinese for #2, Latin script for the Indonesian #3 and German #5, Traditional Chinese for #4, Greek script for #6, and Korean for #7. It seems like there ought to be some simple heuristic by which software can come to the same decision.

Thus, I’m looking for software which can accept a block of character text, and choose which of a set of language-specific fonts provided by the formatting template is most appropriate for formatting that text. Note we don’t really need to recognise language per se, just which fonts provide adequate character coverage. Let’s accept that short texts or pathological cases will be harder to get right, and just try to get full paragraphs of conventional text correct.

Font fallback is one way to solve this problem. Put the font list in a priority order, and test each character against the character coverage of each font on the list in order. Once the system finds font that’s capable of rendering the character, it uses that font.  Some environments (e.g. Microsoft’s IMLangFontLink2 interface) provide a mechanism for doing that. (See “How to display a string without those ugly boxes” by Raymond Chen, and “Font substitution and linking #2” by Michael Kaplan.)  But assume we can’t use IMLang. (My client, in fact, needs to make font switches upstream of a commercial PDF generation library with poor multlingual support.)

In any case, font fallback doesn’t really solve the problem of choosing which font to try first. It just gives you a better outcome when the first font was a bad choice.  And it might select a high-priority Japanese font for Latin-script text, or a simplified Chinese font for traditional Chinese text.

Another approach is language recognition.  Use a library like the TextCat language guesser by Gertjan van Noord. This operates by measuring frequencies of “n-grams”, or adjacent characters, and suggests likely languages. You then could put a mapping from language to its preferred font in your formatting. (If you want TextCat, it looks like SpamAssassin’s copy of TextCat may be more current than van Noord’s copy.)  There are some 20 other language guessers out there, too.

Both font fallback and language recognition seem a bit heavyweight. Lots of code, lots of storage, lots of computation to make what seems to be a simpler decisions.

Surely there’s a simpler and more direct approach: define character coverage sets for classes of font, and see how the character codes in the text overlap with those sets.  For instance, the Multilingual European Subset MES-1, documented in CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1, is probably a good coverage set for Latin script.  The kana and kanji parts of JIS Levels 1 and 2 might be a good coverage set for Japanese.

It might also be helpful to define sets of telltale characters, which strongly indicate one particular script. For instance, kana characters would be a strong indicator of Japanese as opposed to Chinese text. Similarly, there are probably sets of simplified and traditional Chinese character pairs which could be good telltales.

Someone has done this as a guide for human readers. See the Wikipedia Language recognition chart (which, though, is at a language rather than script granularity).

Has someone done this as a software module, which is available for licensing?  I can’t find it.  It would seem apropos for the International Components for Unicode (ICU), but I don’t see it there.

Please post a comment or contact me if you know of such a module. Or if I’ve completely missed the some factor.  Thanks!