Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?
Posted by Jim DeLaHunt on 31 Dec 2010 at 08:18 pm | Tagged as: i18n, language, software engineering, Unicode
Software engineers sometimes are called on to specify which encoding a text file format should use. These days, the top contenders for encoding are UTF-8 and UTF-16, both based on the Unicode Standard. One factor (amongst several, and perhaps not the most compelling) in choosing between them is storage efficiency: the number of bytes per character, or amount of storage per unit of text. If a given text takes a kilobyte of storage in UTF-8 and twice that in UTF-16, that’s a difference, which may be meaningful.
I recently looked for quantitative data about space efficiency of UTF-8 and UTF-16, and couldn’t find very much. Engineering discussions about storage efficiency are better informed by quantitative data than by opinion and supposition. I want to give one morsel of quantitative data more visibility, and clarify this issue.
An example of where storage efficiency arises, in a choice between UTF-8 and UTF-16, comes from the XML-dev mailing list, which has recently been discussing creating a simpler subset or alternative to XML 1.0. An obvious simplification was to limit such a language to Unicode-encoded text. This led to a choice between accepting just UTF-8, just UTF-16, or both. I was surprised by comments regarding UTF-8, saying it was so clearly less storage-efficient than UTF-16 (or non-Unicode encodings) that the excessive storage would torpedo adoption of the format.
That doesn’t match my intuition. But what do the data say?
UTF-8 is a variable-length encoding of Unicode scalar values as bytes (or octets). UTF-16 is a variable-length encoding of Unicode scalar values as 16-bit code units, each of which can be stored as two bytes. The formal UTF-8 and UTF-16 specifications are in Section 3.9 of the Unicode Standard (version 5.2.0 or 6.0.0). For our purposes we can summarise them as follows:
Unicode scalar values | Bytes per character | |
---|---|---|
UTF-8 | UTF-16 | |
U+0000-U+007F | 1 | 2 |
U+0080-U+07FF | 2 | 2 |
U+0800-U+D7FF | 3 | 2 |
U+D800-U+DFFF | n/a (see note below) | n/a (see note below) |
U+E000-U+FFFF | 3 | 2 |
U+10000-U+10FFFF | 4 | 4 |
Note about U+D800-U+DFFF: these are the high-surrogate and low-surrogate code points, used as UTF-16 code units. They aren’t legal Unicode Scalar Values, so no well-formed coded character sequence will call for these values. This is all defined in excruciating technical detail in Chapter 3 of The Unicode Standard.
What’s important to see from this table is that the number of bytes per character depends on which characters you are trying to encode, and so the overall bytes-per-character ratio will depend on the frequency distribution of the characters in your text. It’s also worth noting that the storage required to encode a Unicode scalar value in UTF-8 or UTF-16 is the same for some character ranges, is never more than 1 byte per character different, and that each form has the advantage for some range of Unicode scalar values.
What’s needed is obviously to find texts in various languages, encode them in UTF-8 and UTF-16, and count the number of bytes required for each form. Divide the byte count by the number of Unicode scalar values, and you get a byte-per-character ratio. It’s best if implementers do this for their own actual texts, since any sample text might have a different frequency distribution of characters than one’s actual text.
But sample texts are better than none. And Mark Davis has done this experiment. He took the plain text of the Universal Declaration of Human Rights in various languages as his subjects. He reported his results in a September 2006 email “Unicode & space in programming & l10n” to the Unicode mailing list. His results deserve more visibility than they’ve received, I think.
Here are Mark Davis’s results, a bit refactored and summarised:
Language | Characters | UTF-8 | UTF-16 | ||
---|---|---|---|---|---|
Bytes | Bytes-per-char | Bytes | Bytes-per-char | ||
English | 10,785 | 10,785 | 1.00 | 21,570 | 2.00 |
Indonesian | 12,656 | 12,656 | 1.00 | 25,312 | 2.00 |
Dutch | 12,921 | 12,923 | 1.00 | 25,842 | 2.00 |
Italian | 12,086 | 12,167 | 1.01 | 24,172 | 2.00 |
German | 12,088 | 12,264 | 1.01 | 24,176 | 2.00 |
Spanish | 12,116 | 12,324 | 1.02 | 24,232 | 2.00 |
French | 12,047 | 12,410 | 1.03 | 24,094 | 2.00 |
Portuguese | 11,444 | 11,840 | 1.03 | 22,888 | 2.00 |
Arabic | 7,842 | 14,009 | 1.79 | 15,684 | 2.00 |
Russian | 11,987 | 21,910 | 1.83 | 23,974 | 2.00 |
Korean | 4,937 | 11,626 | 2.35 | 9,874 | 2.00 |
Hindi | 11,685 | 30,213 | 2.59 | 23,370 | 2.00 |
Chinese | 3,135 | 8,779 | 2.80 | 6,270 | 2.00 |
Japanese | 4,570 | 12,810 | 2.80 | 9,140 | 2.00 |
Some observations from the above table:
- All of these languages use 2 bytes of storage per character in UTF-16. This is because text in modern languages almost never use code points off the Base Multilingual Plane (above U+FFFF). Choosing UTF-16 won’t disadvantage one (living, actively used) human language compared to the others.
- Some languages use barely more than 1 byte of storage per character in UTF-8. Basically these are the languages written in the Latin script. Accented letters may require more than one byte in UTF-8, but the unaccented letters are so much more frequent, there is little impact. For these languages, UTF-8 takes about half the storage of UTF-16.
- Some languages take about the same amount of storage for UTF-8 and UTF-16. Arabic, Russian, and Korean are in this category. The storage needed doesn’t differ by more than 20% between the two formats.
- Some languages take significantly more than two bytes of storage per character in UTF-8. Hindi, Chinese, and Japanese are in this group. Even here, the difference is not large: UTF-8 uses only 40% more bytes per character on average than does UTF-16.
The conclusion I draw is that both UTF-8 and UTF-16 have roughly similar storage efficiency for actively-used living languages. Some languages will see fewer bytes per character in one format, some will see fewer in the other, but the difference isn’t large.
Mark Davis makes two other observations in his Unicode mailing list post which I think are worth noting:
1. Proportion of data. A huge amount of space is often taken up by structure or other information. For example, it’s stunning to see how little of the web is textual content. Toss a few images on a web page, for example, and any differences in text size are completely swamped. Even without images, HTML markup takes up a very large proportion of the web. Similarly, take a Word document or PowerPoint presentation, extract just the text and compare the sizes…
2. Character frequency. One can’t just compare the amount that a particular character will grow or shrink; you have to look at the frequency of usage of characters in the language. The frequency of space (a single byte in UTF-8) can play a significant role in average storage requirements, for example.
An engineer who is worrying about the storage efficiency of UTF-8 versus UTF-16 should ponder how much storage will be used for storing stuff other than text. The smaller the fraction of storage devoted to text, the less significant is the question of storage efficiency for that fraction. And the point about character frequency highlights that texts in many languages may use a surprising amount of latin characters, which are stored efficiently in UTF-8.
And, don’t forget: general-purpose compression algorithms like LZW do a perfectly fine job of compressing octet sequences containing UTF-8 or UTF-16. If storage efficiency of text is a real issue, maybe deciding whether or not to compress the text is more significant than whether you use UTF-8 or UTF-16 to encode that text.
It would be useful to gather more data. How do the net storage efficiency numbers change if the texts are left marbled with markup, instead of reduced to a plain text? Does the choice of Unicode normalisation form affect the efficiency much? How effective is compression on the encoded text? Also, since everyone’s texts are different, it would be great to have some code or a service which would take a set of texts and generate storage efficiency statistics like the above. I love it when engineering arguments can be ended by gathering quantitative data.
In the meantime, I think that Davis’ study gives strong evidence for the proposition that, when choosing a text encoding, the difference in storage efficiency between UTF-8 and UTF-16 isn’t that great, and probably shouldn’t be the governing factor in your choice.