Jim DeLaHunt, world-ready

Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?

Software engineers sometimes are called on to specify which encoding a text file format should use.Â These days, the top contenders for encoding are UTF-8 and UTF-16, both based on the Unicode Standard. One factor (amongst several, and perhaps not the most compelling) in choosing between them is storage efficiency: the number of bytes per character, or amount of storage per unit of text. If a given text takes a kilobyte of storage in UTF-8 and twice that in UTF-16, that’s a difference, which may be meaningful.

I recently looked for quantitative data about space efficiency of UTF-8 and UTF-16, and couldn’t find very much. Engineering discussions about storage efficiency are better informed by quantitative data than by opinion and supposition. I want to give one morsel of quantitative data more visibility, and clarify this issue. Continue Reading »

No Comments »

Culture, and software engineering, in British Columbia

December 2010

Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?

Search

Tags

Archives

Pages