I adopted Unicode character U+5B57 「å—ã€!
Posted by Jim DeLaHunt on 28 Feb 2022 at 11:47 pm | Tagged as: Japan, language, Unicode, web technology
One fun thing I did, late in 2021, was to donate a bit of money to the Unicode Consortium to sponsor U+5B57 「å—ã€, my favourite of their more than 144,000 characters. It is a silly thing, but also a bit noble, and a bit useful, and a bit interesting if one peels back the cover and looks at the mechanisms to which it connects. In other words, it is the sort of thing I like to do.
The Unicode Consortium is a US-based charity, a 501(c)3 nonprofit organisation. For over 30 years, it has worked to “make modern software and computing systems support the widest range of human languages.” How wide? “There are approximately 7,000 living human languages…. Less than 100 of these languages are well-supported on computers, mobile phones, and other devices, while all the rest risk being digitally disadvantaged.” This is how the Consortium describes how they use donations.
The 30 years of Unicode neatly overlap with my career path as a software engineer. I started roughly when they did, and my interest in the places where culture and technology intertwine are threaded through with Unicode-related contributions. I support them, and attend their conferences, partly because it is fascinating content, and partly because I want to live in a world where this kind of work thrives.
Nonprofit organisations live on donations. A major source of Unicode Consortium donations come from membership fees paid by big names like Apple, Microsoft, Adobe, Google (Alphabet), Facebook, and other companies; UC Berkeley; and quasi-governmental bodies from India, Bangladesh, Oman, and more. In return for that, they get access, and a vote on technical decisions about the Unicode Standard. But individuals also become members of the Consortium, paying lower fees. This doesn’t buy a vote, but it does give a bit of access to the email chatter about Unicode Standard decisions. More importantly, it demonstrates that the Consortium’s work has a broad base of support. They can parlay that into other kinds of financing and support.
I have been proud to be an individual member of the Unicode Consortium for several years. In 2021, I donated a little more, and became a Lifetime Member of the Unicode Consortium (the 55th, apparently). If you appreciate being part of a world which has the Unicode Standard in it, then I would encourage you to become an individual member also. There is an easy-to-use form on the Unicode website, and it takes credit cards.
Another stream of donations to the Consortium in recent years has been from adopting characters. This is a bit of a gimmick, like “buying a brick” at a new hospital or museum. You make a donation, and get your name inscribed on a brick. The Unicode Consortium does not have bricks, but it does have those 140,000 characters. Donate a little, and you can be one of many “Bronze” sponsors of a particular character. Give a little more, and you can be one of a few “Silver” sponsors. Stretch your wallet (or divert a rounding error of your Big Corporation’s budget) and you can be the sole “Gold” sponsor. What do you get? Well, a badge, like mine above. You appear on a web page of adopted Unicode characters. You can claim a certain degree of bragging rights. But mostly, you get the satisfaction of helping humans communicate better on modern software and computing systems.
Let me introduce you to the character that I adopted: U+5B57 . It is used in writing Japanese, Chinese. Its primary meaning is “character”. You might call it the character for “character”. I like to mangle this phrase in Japanese as「å—ã®å—〠“ji no ji”.
I have a soft spot for this ji no ji. I use it to write my name “Jim” in Japanese: 「å—夢〔jimu”. This could be read as “dreams of characters”. It is apt, for someone who has worked with text and fonts for much of my career. I have also spent a lot of effort studying “kanji”, and working with “characters”. Plus, it is such a simple character that even my inadequate Japanese skills can write it!
The shape of ji no ji comes from the Han script, developed centuries ago in China. It is thought of as having two parts: radical 39 below, similar to the character for “child”, and a 3-stroke part above, similar to the character for “lid”. It is written with six strokes, as depicted in this marvellous animation (by Jim Rose, licensed from Kanji Café).
The exact shape of the character is a matter of typeface design. The depiction above uses the Kozuka Mincho Pro Light font from Adobe Systems. It is in the “mincho” style, a stylisation of brush calligraphy, adapted for typography. Other typefaces would have different stroke shapes, different balances between the thick and thin strokes, and so on. I chose the L (light) weight to emphasise the design’s width contrasts. There are other styles. The Japanese “Gothic” style has squared stroke ends and more uniform stroke width. A more traditional style puts the top-most stroke at an angle. The CHISE project page for å— displays the character in many different typeface designs.
In Unicode terms, a character is an abstract concept. The specific shape drawn to represent the concept is a “glyph”. This diagram shows the shape of the glyph associated with U+5B57. I chose the Medium weight of Kozuka Mincho Pro to make the thin strokes wide enough to separate the sides. Modern computer systems render glyphs with the help of digital fonts. The part of the font which draws a glyph is a computer program in itself. There are precise instructions to draw curves and lines, to close shapes, to stroke and fill. The “machine code” for the Kozuka Mincho Pro font glyph programs is the Compact Font Format. The overall structure of the digital font uses the OpenType font format (and related ISO/IEC 14496-22 “Open Font Formatâ€).
This U+5B57 character is commonly used in Japanese and Chinese. The Japanese word 「漢å—〔kanji” means “Japanese character”, and the 「å—〠“ji” in this word is spelled using U+5B57. The word 「文å—〠“moji” is the translation of English “character”. The lovely Japanese term「文å—化ã‘ã€, written “mojibake”, pronounced “moji-ba-keh”, literally means “character freakout,”. It described a once-common problem where a computer system would misinterpret some text and display it as wildly wrong characters.
In Japanese, the U+5B57 character has the pronunciation “ji” (known as the “on-yomi”), and more rarely “aza”, “umu”, or “masu” (known as the “kun-yomi”). Jim Breen’s WWWJDIC dictionary page on U+5B57 has more linguistic information on ji no ji.
The name “U+5B57” is a way the Unicode Standard refers to this character. The prefix “U+” is a Unicode convention. “5B57” is a hexadecimal form of the number 23,383. This number is the Unicode scalar value for this character. Fundamentally, each character in the Unicode Standard is referred to by a scalar number, which can appear in various forms. Hexadecimal happens to be compact and convenient.
U+5B57 is one of the CJK Unified Ideographs, used for writing Japanese, multiple kinds of Chinese, Korean, and once upon a time, Vietnamese. As such, the Unicode Standard does not give it a textual name.
The Unicode Standard provides a great deal of cross-reference information about U+5B57 in the Unihan Database. For instance, the Japanese JIS-X0208 standard lists this character is decimal number 15,226 (hexadecimal 3B7A). The Dai Kanwa Jiten dictionary lists it as number 6942.
Adobe Systems has gone to great efforts to standardise the glyphs which appear in Chinese, Japanese, and Korean fonts. For Japanese fonts, they published the Adobe-Japan1 Character Collection specification, cataloguing each of several thousand glyphs which might appear in a Japanese font. Adobe-Japan1 lists the glyph for U+5B57 as number 2,248.
When writing Unicode characters to a computer file, or other kinds of storage, the Unicode Standard defines Unicode Transformation Formats to represent them. In the UTF-8 format, U+5B57 is stored as three 8-bit units: E5 AD 97 in hexadecimal. In the UTF-16BE format, it is stored as one 16-bit unit: 5B57. In the UTF-16LE format, it is stored as one 16-bit quantity, but “little end” first: 575B. (Inverted though it may seem, the LE form is more convenient on many widely used computer systems today.)
So, U+5B57 is an commonplace, unassuming character. But it has sentimental meaning for me. And it is a jumping-off point to understanding much about how text, characters, and glyphs are handled in modern computer systems. I am grateful to the Unicode Consortium for its work to make all this possible. I am glad to have taken the opportunity to support that work.