Unicode

Archived Posts from this Category

I adopted Unicode character U+5B57 「字」!

Posted by on 28 Feb 2022 | Tagged as: Japan, language, Unicode, web technology

The Gold Sponsor of U+5B57 「字」

One fun thing I did, late in 2021, was to donate a bit of money to the Unicode Consortium to sponsor U+5B57 「字」, my favourite of their more than 144,000 characters. It is a silly thing, but also a bit noble, and a bit useful, and a bit interesting if one peels back the cover and looks at the mechanisms to which it connects. In other words, it is the sort of thing I like to do.

Continue Reading »

StackOverflow 10K

Posted by on 31 Jan 2022 | Tagged as: i18n, Python, software engineering, technical support, Unicode, web technology

I have been active on StackOverflow for more than twelve years. StackOverflow is a phenomenally successful question and answer website, aimed at software developers seeking technical answers. Part of what makes StackOverflow successful is that it gamifies “reputation”: your reputation goes up when you write good answers, and ask good questions, and otherwise help. On 23 December 2021, my StackOverflow reputation rose past 10,000. This is a gratifying milestone.

I am user Jim DeLaHunt on StackOverflow. I apparently posted my first question there on 23. November, 2009. I asked if anyone could point me to “an XML language for describing file attributes of a directory tree?” I did not get a good direct answer. I did get a reference to the XML-dev email list, which I follow to this day. My first answer was to my own question about the XML language. My first answer to someone else’s question was about three weeks later, and it was about detecting a character encoding.

Over twelve years, I have written 133 answers, most of which languish in obscurity. Three have earned particularly many upvotes (and, between them, over 40% of my reputation):

  1. How to escape apostrophe (‘) in MySql?” This is a pretty simple answer. I suspect that it gets a lot of upvotes because many people ask this question. My answer also has the virtue that it quotes a specific clause in the official documentation to prove that the answer is correct. Not all StackOverflow answers cite reliable sources. This answer has earned 226 votes to date, bringing in over 22% of my total reputation.
  2. Is there a way to pass optional parameters to a function?” This too is a simple answer to a frequently-asked question. I cited an official source in this answer also. This answer has earned 116 votes to date, bringing in over 11% of my total reputation.
  3. What exactly is a “raw string regex” and how can you use it?” I think this is the best answer of the three. It finds a way to clarify a particularly murky area of the Python language, which often baffles people. I think it is easier to understand than the official documentation. This answer has earned 108 votes to date, bringing in over 10% of my total reputation. I think it was a vote on this question which put me over 10,000. I like that.

StackOverflow turns the reputation score into a variety of rankings. They put me in the top 4% for reputation overall. This sounds very impressive, until you learn that I am only 24,308-ranked among all participants. Mind you, there are over 16 million participants. I imagine there is a long, inactive tail, compared to which my small activity looks great.

In a similar vein, StackOverflow ranks me among the top 5% in the topics of “Python” and “MySQL“; the top 10% in “Unicode“; and the top 20% in “Internationalization“, “UTF-8“, and “Django“. That reflects some combination of effort on my part, and flattery due to the long, inactive tail.

I put a lot of work, 8-10 years ago, into answering questions and building my reputation. Now I find that upvotes trickle in for my existing 133 questions. My reputation rises surprisingly steadily, even if I don’t contribute anything new, giving me a kind of StackOverflow pension. But I still get satisfaction from plugging away there every now and again, trying to find a good question and write a clear answer. Maybe, in less than 12 years from now, I might reach StackOverflow 20,000.

Top issues in Universal Acceptance of non-Latin email addresses and domain names (IUC45 session)

Posted by on 31 Oct 2021 | Tagged as: meetings and conferences, Unicode, Universal Acceptance

Two weeks ago was the Internationalization and Unicode Conference. This year is the 45th conference, or IUC45. I delivered a presentation: Top issues in Universal Acceptance of non-Latin email addresses and domain names. Here are my slides.

Continue Reading »

IUC45 talk: “Top issues in Universal Acceptance”

Posted by on 30 Jun 2021 | Tagged as: meetings and conferences, Unicode, Universal Acceptance

I’m delighted to be presenting, once again, to the 45th Internationalization and Unicode Conference (IUC45).  The conference is the gathering of my “tribe”, people who are as enthusiastic about language, text, and software as I am. If you like this stuff, it’s the best place in the world to be for those three days. Or, given the pandemic, the conference might be partially or completely virtual, so that webcast is the best UDP session in the world. In either case, please register and join us there.

Continue Reading »

Earth, Moon, and abolishing leap seconds: the curious astronomy and politics of time() (IUC44 session)

Posted by on 30 Nov 2020 | Tagged as: culture, meetings and conferences, time, Unicode

Last month was the pandemic-distanced rendition of the Internationalization and Unicode Conference. This year is the 44th conference, or IUC44.  In addition to a tutorial (blogged about last month), I delivered a presentation: Earth, Moon, and abolishing leap seconds: the curious astronomy and politics of time(). Here are my slides, and a video of me talking through my slides.

Continue Reading »

Email addresses and domain names are NON-latin! Now what? (IUC44 tutorial)

Posted by on 31 Oct 2020 | Tagged as: meetings and conferences, Unicode, Universal Acceptance

Two weeks ago was the pandemic-distanced rendition of the Internationalization and Unicode Conference. This year is the 44th conference, or IUC44.  In addition to a presentation (to be blogged later), I delivered a tutorial: Email addresses and domain names are NON-latin! Now what? Here are my slides, and a video of me talking through my slides.

Continue Reading »

PostScript code converting UTF-8 to UTF-16

Posted by on 31 Aug 2020 | Tagged as: robobait, software engineering, Unicode

This is a little bit of code which was fun and nostalgic to write, even though the motivating project fell through. I wrote PostScript language functions to convert strings with UTF-8 contents, into strings with UTF-16 contents. This was intended to be part of a batch tool to convert PDF documents to PDF/A format, but that did not work out. However, the code works, and here it is.

Continue Reading »

A settler’s guide to to reading, typing, and spelling Vancouver’s new shibboleths

Posted by on 30 Jun 2018 | Tagged as: community, culture, Unicode, Vancouver

My home, Vancouver B.C., just announced new names for two public places: “šxʷƛ̓ənəq Xwtl’e7énḵ Square” and “šxʷƛ̓exən Xwtl’a7shn” . In contrast to just about every other name in this town, these names are not Scottish- or English-derived. Nor are they a Chinese phoneticisation of a Scottish-derived name. Instead, at long last our town asked the First Nations leaders, whose people have been here the longest by far, to contribute the names. I think it is awesome. It is a step towards reconciliation, tiny but real. I think these names will become Vancouver’s new shibboleths.

But names like these represent change, and change is unsettling. The characters are unfamiliar-looking! We don’t know how to pronounce them! There are rectangular boxes showing missing text! There is no ə key on our keyboards! Heh. We seem to have no problem expecting immigrants who grew up with Chinese or Ge’ez or Gujurati writing to learn how to write and pronounce “Granville”, but we are reluctant to step up when it’s our turn.

Never fear. I’m a software engineer specialising in internationalisation and Unicode. Let me explain how to read, type, and spell these names.  It’s really very interesting. Continue Reading »

Top Posts: Why Unicode has separate codepoints for “characters with identical glyphs”

Posted by on 31 May 2018 | Tagged as: i18n, multilingual, robobait, software engineering, Unicode

I post on various forums around the net. Sometimes I am able to tap into such inspiration that I want to add that essay to my portfolio. Such was the case here. The question: Why does Unicode have separate codepoints for characters with identical glyphs? My response begins: The short answer to this question is, “Unicode encodes characters, not glyphs”. But like many questions about Unicode, a related answer is “plain text may be plain, but it’s not simple”.… Continue Reading »

Email addresses and domain names are NON-latin! Now what? (IUC41 tutorial)

Posted by on 28 Feb 2018 | Tagged as: i18n, meetings and conferences, multilingual, Unicode, web technology

Last fall I attended the Internationalization and Unicode Conference. That year was the 41st conference, or IUC41.  In addition to a presentation (described in a blog last October), I delivered a tutorial: Email addresses and domain names are NON-latin! Now what?  I should have blogged about my slides last October, but better late than never. Here are my slides. Continue Reading »

Next Page »