Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?

Posted by on 31 Dec 2010 | Tagged as: i18n, language, software engineering, Unicode

Software engineers sometimes are called on to specify which encoding a text file format should use.  These days, the top contenders for encoding are UTF-8 and UTF-16, both based on the Unicode Standard. One factor (amongst several, and perhaps not the most compelling) in choosing between them is storage efficiency: the number of bytes per character, or amount of storage per unit of text. If a given text takes a kilobyte of storage in UTF-8 and twice that in UTF-16, that’s a difference, which may be meaningful.

I recently looked for quantitative data about space efficiency of UTF-8 and UTF-16, and couldn’t find very much. Engineering discussions about storage efficiency are better informed by quantitative data than by opinion and supposition. I want to give one morsel of quantitative data more visibility, and clarify this issue. Continue Reading »

Transparent PNG images in PHP: imagesavealpha() versus imagecolortransparent()

Posted by on 30 Nov 2010 | Tagged as: robobait, web technology

Are you using PHP (or libGD) to generate PNG images?  Are you having problems getting your text anti-aliased, and also having your “transparent” colour recognised as transparent?  Well, I had that problem too.  libGD, the component which PHP uses to handle image operations, gives you a choice: you can have anti-aliased text, or a designated colour as transparent… but not both.  Here’s why, and what you can do about it.

Continue Reading »

Building Multilingual Websites in Drupal and Joomla, at IUC34

Posted by on 31 Oct 2010 | Tagged as: CMS, drupal, i18n, Joomla, meetings and conferences, multilingual, Unicode

Once again I was fortunate enough to be invited to present at this year’s Internationalization and Unicode Conference (IUC). I have posted the paper and slides for my tutorial, Building Multilingual Websites in Drupal and Joomla, over on jdlh.com.

This was my abstract, from the Unicode conference program for my talk: Continue Reading »

Spencer Conrad Boise, 1924-2010: what a wonderful man

Posted by on 30 Sep 2010 | Tagged as: personal

Uncle Spencer, getting ready to marry Jim (to Ducky), 1998Our dear uncle Spence, Spencer Conrad Boise, passed away earlier this month.

This is a personal blog post: a chance to say, for the record, a little of what this wonderful man meant to me, and to us. Those of you who are here for i18n or technology stuff, we’ll get back to that next time.

When we family gathered to pay our respects, it was one of those bittersweet occasions — wonderful to see everyone, terrible about the reason; a sad event, but then, when a good person leads a long and happy life and has a quick and peaceful death, a happier outcome than others we could imagine.

When I was a young child growing up near Cincinnati, Ohio, Spencer and his inseparable wife Marty, and their five children, lived right next door to us. It gave our families a close bond, which persists to this day. In fact, soon after I met Ducky and we took our first road trip together, it was to visit Spence and Marty, and some of their kids, in Santa Barbara.

Continue Reading »

11 Django gotchas

Posted by on 31 Aug 2010 | Tagged as: Python, robobait, software engineering, Unicode, web technology

This post has been a long time in the making. A year ago, I started work on my Twanguages code. This was code to analyse a corpus of Twitter messages, and try to discern patterns about language use, geography, and character encoding.  I decided to use the Django web framework and the Python language for the Twanguages analysis code.  I know Python, but I was learning Django for the first time.

Django is really, really marvellous.  When I tried this expression, and got the Python array of records I was expecting,

q2 = TwUser.objects.annotate(ntweets=Count('twstatus')).filter(ntweets__gt=1)

I wrote in my log, “I think I just fell in love. Power and concision in a tool, awesome.”

But Django gave me fits.  It has its share of quirks to trap the unwary novice. Eventually I began writing notes about “Django gotchas” in my log.  Some of them are Django being difficult, or inadequate. Some are me being a clueless novice, and Django not rescuing me from my folly. But all of them were obstacles.  I share them in the hopes of helping another Django novice.

Here are my Django gotchas.  They are ranked from the most distressing to most benign. They apply to Django 1.1, the current version at the time. (As of August 2010, the current version is 1.2.1.) A couple of gotchas were addressed by Django 1.2, so I moved them down to a section of their own. The rest presumably still apply to Django 1.2, but I haven’t gone back to check.

  1. API fails unhelpfully. I wrote a simple query expression like:
    S2 = models.TwStatus.objects.get( key )

    I got a lot of weird errors, e.g. “ValueError: too many values to unpack” (where key is string) and “TypeError: ‘long’ object is not iterable” (where key is long). I had made a mistake, of course; the call to get() should have a keyword argument of “id__exact” or the like, not a positional argument. The correct spelling is this:

    S2 = models.TwStatus.objects.get( id__exact=key )

    The gotcha is that Django’s .get() isn’t written defensively. It isn’t very robust to programmer errors. Instead of checking parameters and giving clear error messages, it lets bad parameters through, only to have them fail obscurely deep in the framework. If defensive programming of the Django API would slow it down too much in production, I’d love to have a debug mode I could invoke during development. Continue Reading »

Mobile tech transforms travelling

Posted by on 31 Jul 2010 | Tagged as: travel

An unexpected insight from  our recently-completed Turkey vacation was the way mobile devices and wifi networks simplified and improved our travel experience. It was a qualitative change in our travel experience, affecting how we spent our time and our money, and what we packed.

The crucial change was that I carried a borrowed Android smartphone, in place of my 2005-vintage Treo 650. The phone spoke Turkey’s GSM frequencies and was unlocked, ready to accept a local GSM chip. It could use wi-fi and 3G networking, and had a built-in camera. Ducky carried her iPhone, which she didn’t have for last summer’s Botswana trip. Though locked out of Turkish voice and data service, it still could use wi-fi networks and take pictures. We were travelling in parts of Turkey with good mobile voice and data coverage, and where we could easily steer to hotels with wi-fi service.

Let us count the ways this transformed our travel experience.

Continue Reading »

How about an IMLIG (Internationalisation, Multilingual, Localisation Interest Group) for Vancouver?

Posted by on 27 Jun 2010 | Tagged as: i18n, language, meetings and conferences, multilingual, Unicode, Vancouver, web technology

There is a lot of international, multilingual, and multicultural activity in Vancouver. Also, there’s a thriving tech scene. But there’s no place for the people in the intersection of those two circles — those interested in and working on the internationalisation, localisation, and multilingual aspects of technology projects — to get together and share ideas. I think there ought to be.

And I’ll even propose a name: IMLIG1604, the I18n L10n M11l I6t G3p (Internationalisation, Localisation, and Multilingual Interest Group) for North America’s 604 area code. If you can decipher the title, you’re in the club!

Continue Reading »

Why the PostScript language is Turing-complete

Posted by on 30 Apr 2010 | Tagged as: software engineering

A couple of weeks ago on the XML-dev mailing list, there was a discussion comparing declarative and procedural computer languages. Someone wondered why the PostScript language, though used mostly for declarative purposes like describing pages, was still a Turing-complete programming language. That’s actually a topic I know something about, so I contributed the following answer. I’m posting it here, lightly edited, because I thought it might be of wider interest. —JDLH

A good place to go for a discussion of why it is Turing-complete, despite being intended to describe page appearance, is in the Introduction (Chapter 1) of the PostScript Language Reference Manual.

In particular, it says, “The extensive graphics capabilities of the PostScript language are embedded in the framework of a general-purpose programming language. The language includes a conventional set of data types, such as numbers, arrays, and strings; control primitives, such as conditionals, loops, and procedures; and some unusual features, such as dictionaries. These features enable application programmers to define higher-level operations that closely match the needs of the application and then to generate commands that invoke those higher-level operations. Such a description is more compact and easier to generate than one written entirely in terms of a fixed set of basic operations.” Continue Reading »

Tools for setting classical music and opera scores free

Posted by on 31 Mar 2010 | Tagged as: culture

I’m an amateur opera and symphonic chorus singer. Most of the classical music and opera I perform is old. Not just pre-iPhone old, but usually well over a hundred years old. These works have outlived even the outrageously long copyright terms imposed on our culture by greedy commercial interests. They are clearly in the public domain; they have returned to the shared culture from which they grew.

But when I want to learn a new work, like Verdi’s opera Macbeth or Mozart’s Requiem, why do I find myself paying $24-$40 for a music score which probably cost $5 to print? Why does the book contain stern warnings not to photocopy the contents, even it is little more than a facsimile of a previous edition, which itself is in public domain?  It is because these music score products still cling to a pre-internet business model, based on selling “molecules” (the physical artifact of the book) for a price based on the value of the “bits” (the information or arrangement of notes we call the musical composition, plus the value of the editing, plus the value of the typesetting), and the costs of distributing and warehousing those molecules.

This shouldn’t be. The music itself — the bits, the abstract genius which is Beethoven’s or Mahler’s, not the later editorial changes, or the molecules on which the bits are printed — is in the public domain, so its cost is zero. Volunteers are willing to scan or transcribe old musical scores for free. So a digital file with a score ought to be accessible for the marginal cost of storage, duplication and delivery.  And in an era of cheap disks and high-speed internet, that marginal cost is zero.

Many classical music and opera scores are indeed available, free for the downloading. Below are links to some useful sites for the classical or opera musician to find them. But there’s more. In the digital world, scores should get better, too: more correct, easier to use, more customised. If a fraction of every chorus and orchestra pitched in to ratchet forward the quality of the free scores for music they perform, we could make a huge difference.

Continue Reading »

Birdwatching the 2010 Olympics police

Posted by on 28 Feb 2010 | Tagged as: Canada, meetings and conferences, Vancouver

Mémphramagog Police shoulder flashAs the cheers still resound outside my apartment, from the street party below, let me report on my own Olympic sport: police-spotting. It’s like bird watching, but for police agencies.

Some 118 different police agencies from across Canada came to the Vancouver area as part of the $900 million 2010 Olympics security effort.  The RCMP sent over 4000 officers from provinces across Canada; various municipal police departments sent some 1700 more.  (20% of Canada’s policing power was at the Olympics.) I figured it would be fun to say hello to a constable from every one of those agencies. I didn’t get to them all, but it was fun trying.

Continue Reading »

« Prev - Next »