Unicode

Archived Posts from this Category

Top Posts: StackOverflow “Django headache with simple non-ascii string”

Posted by on 31 May 2013 | Tagged as: Python, software engineering, Unicode

I post on various forums around the net, and a few of my posts there get some very gratifying kudos. I’ve been a diligent contributor to StackOverflow, the Q-and-A site for software developers. I’m in the top 15% of contributors overall, and one of the top 25 answerers of Unicode-related questions. Here’s my top-voted answer in StackOverflow so far.

The question, Django headache with simple non-ascii string,  was asked by user Ezequiel in January 2010. In abbreviated form, it was:

Continue Reading »

I18n and Unicode conference, and tutorial on multilingual Drupal and Joomla web sites, complete

Posted by on 31 Oct 2012 | Tagged as: CMS, culture, digital preservation, drupal, i18n, Joomla, meetings and conferences, multilingual, Unicode

Another stimulating Internationalisation and Unicode Conference (IUC36) just finished up last week (October 22-24, 2012). As usual it was rich with interesting people, stimulating subjects, and inspiration. My tutorial, Building multilingual websites in Drupal 7 and Joomla! 2.5, was well-attended and seemed to go well. My final paper and slides are posted at the preceding link.

Continue Reading »

“Building multilingual websites in Drupal 7 and Joomla 2.5” (IUC36 tutorial)

Posted by on 30 Jun 2012 | Tagged as: drupal, Joomla, meetings and conferences, Unicode

I’m delighted to be asked, once again, to present a tutorial on Building multilingual websites in Drupal 7 and Joomla! 2.5, at the 36th Internationalization and Unicode Conference (IUC36), this October in Santa Clara, California, USA.

Continue Reading »

Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?

Posted by on 31 Dec 2010 | Tagged as: i18n, language, software engineering, Unicode

Software engineers sometimes are called on to specify which encoding a text file format should use.  These days, the top contenders for encoding are UTF-8 and UTF-16, both based on the Unicode Standard. One factor (amongst several, and perhaps not the most compelling) in choosing between them is storage efficiency: the number of bytes per character, or amount of storage per unit of text. If a given text takes a kilobyte of storage in UTF-8 and twice that in UTF-16, that’s a difference, which may be meaningful.

I recently looked for quantitative data about space efficiency of UTF-8 and UTF-16, and couldn’t find very much. Engineering discussions about storage efficiency are better informed by quantitative data than by opinion and supposition. I want to give one morsel of quantitative data more visibility, and clarify this issue. Continue Reading »

Building Multilingual Websites in Drupal and Joomla, at IUC34

Posted by on 31 Oct 2010 | Tagged as: CMS, drupal, i18n, Joomla, meetings and conferences, multilingual, Unicode

Once again I was fortunate enough to be invited to present at this year’s Internationalization and Unicode Conference (IUC). I have posted the paper and slides for my tutorial, Building Multilingual Websites in Drupal and Joomla, over on jdlh.com.

This was my abstract, from the Unicode conference program for my talk: Continue Reading »

11 Django gotchas

Posted by on 31 Aug 2010 | Tagged as: Python, robobait, software engineering, Unicode, web technology

This post has been a long time in the making. A year ago, I started work on my Twanguages code. This was code to analyse a corpus of Twitter messages, and try to discern patterns about language use, geography, and character encoding.  I decided to use the Django web framework and the Python language for the Twanguages analysis code.  I know Python, but I was learning Django for the first time.

Django is really, really marvellous.  When I tried this expression, and got the Python array of records I was expecting,

q2 = TwUser.objects.annotate(ntweets=Count('twstatus')).filter(ntweets__gt=1)

I wrote in my log, “I think I just fell in love. Power and concision in a tool, awesome.”

But Django gave me fits.  It has its share of quirks to trap the unwary novice. Eventually I began writing notes about “Django gotchas” in my log.  Some of them are Django being difficult, or inadequate. Some are me being a clueless novice, and Django not rescuing me from my folly. But all of them were obstacles.  I share them in the hopes of helping another Django novice.

Here are my Django gotchas.  They are ranked from the most distressing to most benign. They apply to Django 1.1, the current version at the time. (As of August 2010, the current version is 1.2.1.) A couple of gotchas were addressed by Django 1.2, so I moved them down to a section of their own. The rest presumably still apply to Django 1.2, but I haven’t gone back to check.

  1. API fails unhelpfully. I wrote a simple query expression like:
    S2 = models.TwStatus.objects.get( key )

    I got a lot of weird errors, e.g. “ValueError: too many values to unpack” (where key is string) and “TypeError: ‘long’ object is not iterable” (where key is long). I had made a mistake, of course; the call to get() should have a keyword argument of “id__exact” or the like, not a positional argument. The correct spelling is this:

    S2 = models.TwStatus.objects.get( id__exact=key )

    The gotcha is that Django’s .get() isn’t written defensively. It isn’t very robust to programmer errors. Instead of checking parameters and giving clear error messages, it lets bad parameters through, only to have them fail obscurely deep in the framework. If defensive programming of the Django API would slow it down too much in production, I’d love to have a debug mode I could invoke during development. Continue Reading »

How about an IMLIG (Internationalisation, Multilingual, Localisation Interest Group) for Vancouver?

Posted by on 27 Jun 2010 | Tagged as: i18n, language, meetings and conferences, multilingual, Unicode, Vancouver, web technology

There is a lot of international, multilingual, and multicultural activity in Vancouver. Also, there’s a thriving tech scene. But there’s no place for the people in the intersection of those two circles — those interested in and working on the internationalisation, localisation, and multilingual aspects of technology projects — to get together and share ideas. I think there ought to be.

And I’ll even propose a name: IMLIG1604, the I18n L10n M11l I6t G3p (Internationalisation, Localisation, and Multilingual Interest Group) for North America’s 604 area code. If you can decipher the title, you’re in the club!

Continue Reading »

Twanguages: a Language Census of Twitter

Posted by on 30 Jul 2009 | Tagged as: language, meetings and conferences, multilingual, Unicode, web technology

What “twanguage” do you “tweet”?  Twitter, the buzzing conversation of brief web and SMS messsages, exploded into wide use in 2009. But just how wide?  To how many countries has it spread?  And into which languages?  I’m aiming to find out.

I’ve started a project named “Twanguages”, a language census of a sample of Twitter’s global traffic. I’m curious: which are the top languages? Are #hashtags localised? How does language correlate with location?  And which Unicode character is the most rarely used?

I’ll be  presenting our results at the 33rd Internationalization and Unicode Conference (IUC33), held in San Jose, California, on October 14-16, 2009. I have a place cleared for a Twanguages project page, and I’ll post interim results there as they become available (right now it’s only a placeholder). Stay tuned!

Continue Reading »

Unicode Doggerel (“I am the very model of a modern text encoding scheme”)

Posted by on 10 Sep 2008 | Tagged as: culture, meetings and conferences, Unicode

This was fun!  On Tuesday night (9. Sept 2008), there was a tribute to the 20th anniversary of Unicode at the 32nd Internationalization and Unicode Conference.  I wrote this in a creative fury on Monday afternoon. The anniversary celebration was at an evening reception. It was very funny and enjoyable. Several other people contributed amusing tributes. My song appeared to be well-received. I hope you enjoy it.

Unicode Doggerel

(Sung to the tune of  “I am the Very Model of a Modern Major General”, by Gilbert and Sullivan.)

I am the very model of a modern text encoding scheme,
a million scalars, astral planes, and UTFs like six-&-teen,
and UAX and UTR, collation, bidi, properties,
I am the very model of a modern text encoding scheme.

Continue Reading »

Simple script-detection algorithm for font switching?

Posted by on 26 Aug 2008 | Tagged as: i18n, language, multilingual, software engineering, Unicode

Does anybody know of a simple script-detection algorithm (or heuristic) for font switching?

This came up with one of my clients. Suppose you have a guest book on your web site, and seven visitors left you the following inspiring messages:

  1. すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについて平等である。
  2. 人人生而自由,在尊严和权利上一律平等。
  3. Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.
  4. 人人生而自由,在尊嚴和權利上一律平等。
  5. Alle Menschen sind frei und gleich an Würde und Rechten geboren.
  6. ‘Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.
  7. 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.

(It looks like your visitors all read the Universal Declaration of Human Rights courtesy of the UDHR in Unicode project).

Now suppose you are so touched that you want to lay out all seven messages in a PDF file, and print it out as a booklet.  You have a beautiful layout template, and various complementary fonts: Latin script, Japanese, Korean, simplified Chinese, Traditional Chinese, and Greek script.

Which font to you apply to each message?  More importantly, is there a simple heuristic by which software can make the choice? (More after the jump.)

Continue Reading »

« Previous PageNext Page »