Archived Posts from this Category

Twanguages: a Language Census of Twitter

Posted by on 30 Jul 2009 | Tagged as: language, meetings and conferences, multilingual, Unicode, web technology

What “twanguage” do you “tweet”?  Twitter, the buzzing conversation of brief web and SMS messsages, exploded into wide use in 2009. But just how wide?  To how many countries has it spread?  And into which languages?  I’m aiming to find out.

I’ve started a project named “Twanguages”, a language census of a sample of Twitter’s global traffic. I’m curious: which are the top languages? Are #hashtags localised? How does language correlate with location?  And which Unicode character is the most rarely used?

I’ll be  presenting our results at the 33rd Internationalization and Unicode Conference (IUC33), held in San Jose, California, on October 14-16, 2009. I have a place cleared for a Twanguages project page, and I’ll post interim results there as they become available (right now it’s only a placeholder). Stay tuned!

Continue Reading »

Unicode Doggerel (“I am the very model of a modern text encoding scheme”)

Posted by on 10 Sep 2008 | Tagged as: culture, meetings and conferences, Unicode

This was fun!  On Tuesday night (9. Sept 2008), there was a tribute to the 20th anniversary of Unicode at the 32nd Internationalization and Unicode Conference.  I wrote this in a creative fury on Monday afternoon. The anniversary celebration was at an evening reception. It was very funny and enjoyable. Several other people contributed amusing tributes. My song appeared to be well-received. I hope you enjoy it.

Unicode Doggerel

(Sung to the tune of  “I am the Very Model of a Modern Major General”, by Gilbert and Sullivan.)

I am the very model of a modern text encoding scheme,
a million scalars, astral planes, and UTFs like six-&-teen,
and UAX and UTR, collation, bidi, properties,
I am the very model of a modern text encoding scheme.

Continue Reading »

Simple script-detection algorithm for font switching?

Posted by on 26 Aug 2008 | Tagged as: i18n, language, multilingual, software engineering, Unicode

Does anybody know of a simple script-detection algorithm (or heuristic) for font switching?

This came up with one of my clients. Suppose you have a guest book on your web site, and seven visitors left you the following inspiring messages:

  1. すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについて平等である。
  2. 人人生而自由,在尊严和权利上一律平等。
  3. Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.
  4. 人人生而自由,在尊嚴和權利上一律平等。
  5. Alle Menschen sind frei und gleich an Würde und Rechten geboren.
  6. ‘Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.
  7. 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.

(It looks like your visitors all read the Universal Declaration of Human Rights courtesy of the UDHR in Unicode project).

Now suppose you are so touched that you want to lay out all seven messages in a PDF file, and print it out as a booklet.  You have a beautiful layout template, and various complementary fonts: Latin script, Japanese, Korean, simplified Chinese, Traditional Chinese, and Greek script.

Which font to you apply to each message?  More importantly, is there a simple heuristic by which software can make the choice? (More after the jump.)

Continue Reading »

“Web 2.0 goes to Babel: Multilingual websites and user-supplied content” at IUC32

Posted by on 31 May 2008 | Tagged as: CMS, i18n, Joomla, meetings and conferences, multilingual, Unicode

Oh right, I forgot to mention: I’ve been accepted to present to the 32nd Internationalization & Unicode Conference this September! I’m presenting on a topic which I’ve been working on lately: multilingual websites. The title is: Web 2.0 goes to Babel: Multilingual websites and user-supplied content.

Continue Reading »

« Previous Page