A friend pointed me to a interesting blog post, Which Unicode character should represent the English apostrophe? (And why the Unicode committee is very wrong.) by Ted Clancy, 3. June 2015. The argument: “The Unicode committee is very clear that U+2019 (RIGHT SINGLE QUOTATION MARK) should represent the English apostrophe…. This is very, very wrong. The character you should use to represent the English apostrophe is U+02BC (MODIFIER LETTER APOSTROPHE). I’m here to tell you why why….” [Emphasis in the original.]

I understand that there might be many people on this planet who actually don’t care about English language orthography concerning the apostrophe, contractions, and Unicode plain text representations thereof. Go ahead, skip this post and go on with your day. I am completely captivated by such questions. I started writing a quick reply, which grew to the point where it seemed better to host it on my blog than on Clancy’s comments page.

I think Clancy has a very interesting proposition! I think the Unicode Standard has an important role to play in facilitating textual communication, and I’m glad to see Clancy take it seriously. Many posts of the form “Unicode got X wrong” or “Unicode should do Y” don’t take facilitation of communication seriously. (I’m looking at you, taco emoji partisans.)

I think Clancy’s disagreement is not with the Unicode Technical Committees decisions about U+02BC MODIFIER LETTER APOSTROPHE versus U+2019 RIGHT SINGLE QUOTATION MARK. It is with the UTC’s understanding of English language orthography. Is the apostrophe used for contractions with a word (e.g. “we’ve”) a letter, or is it punctuation, or is it something else? UTR-8, section 4.6 Apostrophe Semantics Errata, clearly thinks of it as punctuation. Clancy clearly thinks of it as a letter. Resolve that disagreement, and the encoding choice follows.

Clancy seems to have an unstated premise that English language orthography lends itself to simple implementation of word processors and regular expressions. So, if the simple implementation doesn’t work, it must be the fault of the character encoding. It couldn’t possibly be a matter of English language orthography being bloody complicated and inconsistent. I disagree with this view. I think English language orthography is a mess. I despair of simple rules like “\w consists of characters with the Unicode letter property” giving intuitive results.

I think it is very hard to figure out how to render these two phrases: 'Tis the Season and 'This is the season', she said., without deep knowledge of the English meanings, or shortcuts like an oddball-cases dictionary. Clancy seems to want users to supply one Unicode character for one case, and another Unicode character for the other. But most users just want to type the vertical apostrophe key on their keyboard, and let the machine figure it out. Thus, I disagree that “We wouldn’t have these problems if apostrophes were represented by U+02BC.” That approach would just push the problem down to the software which tries to map the vertical apostrophe key on the keyboard to either U+02BC or U+2019.

(I hope readers will forgive me having fun with this post’s title. The phrase ‘’tain’t right, says he: storm in apostrophe manages to incorporate that single right-leaning mark as a closing quotation mark, a mid-word contraction apostrophe, and a word-leading contraction apostrophe. I had to type that by means of arcane multi-key presses on my Mac keyboard.)

Clancy doesn’t talk about whether using U+02BC MODIFIER LETTER APOSTROPHE to represent the English language apostrophe would introduce complications of its own. But one of the repeating lessons of the Unicode Standard’s development is that the other choice usually does bring complications. Typographer Mark Everson submitted a thoughtful and detailed paper, On the apostrophe and quotation mark, with a note on Egyptian transliteration characters (ISO/IEC JTC1/SC2/WG2 N2043, 1999-07-24), to the standards committees at the time. He says the modifier letter represented by U+20BC is used in International Phonetic Alphabet, Azerbaijani, Nenets, Nivkh, Chukot, Eskimo, Khanty, Koryak, Kurdish, and Selkup. One of the difficult tasks the standards bodies face is to measure the impact of a change on languages and communities of which the proponent might not be aware.

What would it take to change the Unicode Standard to what he advocates? I’ve never been on the relevant standards committees, so I don’t know the details. But I do watch those committees with interest, from a distance. It seems to me that making a convincing case would require several arguments. Clancy would first need to assemble more evidence of why the status quo is a problem. That would involve, for example, getting support from those word-processing and regular-expression toolmakers that this encoding change would help their situation. Second, he would need to bolster his claims about English language orthography. Is the apostrophe really a letter, rather than punctuation, or some ambiguous and complex other category? Third, he would need to take seriously the other uses for his preferred character, U+02BC MODIFIER LETTER APOSTROPHE. Would his change cause problems for the other uses of that character? Fourth, he would need to address text input: how would software map from a user’s push of a simple apostrophe key on the keyboard to the right character?  Or would the user be forced to understand this issue, and push the right alternate key combination in the right circumstance? How likely would users be to comply? Finally, Clancy would need to write a specific proposal and present it to the right committees: the Unicode Technical Committee, and his national representative on the International Standards Organisation/International Electrotechnical Commision’s Joint Technical Committee 1 / SubCommittee 2 / Working Group 2.

If that sounds like a lot of work, that’s because it is! Spare a thought for all the work represented by all the proposals over the decades, which have made the Unicode Standard the beautiful, messy jewel of technology, culture, and politics that it has become today.

So while I don’t share Clancy’s apparent indignation, or even his conviction that the Unicode Technical Committee got it wrong, I am impressed by Clancy’s willingness to engage with the Unicode Standard at a fairly sophisticated level. We need more people doing this.

[Update: fixed typo in this post’s “slug” (URL). Also copy editing. Sorry.]

[Update 2: corrected location of closing quotation mark within post title, and within text.]