Jim DeLaHunt, world-ready

This is a little bit of code which was fun and nostalgic to write, even though the motivating project fell through. I wrote PostScript language functions to convert strings with UTF-8 contents, into strings with UTF-16 contents. This was intended to be part of a batch tool to convert PDF documents to PDF/A format, but that did not work out. However, the code works, and here it is.

The primary procedure is cvs_utf8. It takes a string literal containing UTF-8 bytes, and returns a string literal containing UTF-16BE bytes.

The scenario for using it is as part of a PostScript language script, combined with a fragment of UTF-8 encoded PostScript language strings and calls to cvs_utf8 supplied by the caller. macOS and Linux programs generate UTF-8 text more easily than UTF-16, while PostScript interpreters have more use for UTF-16 strings than UTF-8. For instance, an interpreter which is distilling a PostScript page description to a PDF file can take UTF-16 strings as input to a /DOCINFO pdfmark call. (See the pdfmark reference for more on /DOCINFO.)

% cvs_utf8.ps
% by Jim DeLaHunt, 2020. Contributed to the public domain.
%
% Strings must be PostScript language string objects containing 
% the text to be put into the PDF document info, encoded in UTF-8,
% with those code units (bytes) represented as PostScript language 
% strings. Likely it will be easiest to generate literal strings, 
% enclosed in parentheses. PostScript language hex and Ascii85 strings
% are accepted also. 
% 
% If using PostScript literal string syntax with the shell "echo"
% command, as in the example above, then be careful of the following.
%  * The string is delimited by parentheses. They may contain balanced
%    pairs of open and closing parentheses.
%  * In shell commands, enclose literal strings in quotes to protect 
%    them from the shell. To the shell, parentheses mean to run a command.
%  * Any unbalanced parentheses in the string must be preceded by a 
%    backslash ('\'). 
%  * Precede all parentheses with backslash, and you won't need to worry
%    about them being balanced.
%  * Precede all backslashes in the string with a backslash.
%  * Non-printing characters (tab, newline) are permitted in the string
%    but are easy to mess up. Replace them with PostScript language 
%    escapes ('\t', '\n') or with octal number sequences ('\011', '\012')

% The following comments use "uni" to stand for a Unicode scalar value.
% This is an integer in the range 0..0x10FFFF which represents a Unicode
% character.

% cvi2_uni: convert Unicode scalar value into a two or four integer values
% for bytes which make up the UTF-16BE code unites for that scalar value.
% It preserves top value of stack for caller to keep using. 
/cvi2_uni {  % uni any cvi2_uni b0 b1 (b2 b3)? any
    exch                   % any uni
	dup 16#10000 ge {
		% uni is large enough to require encoding as surrogate pairs.
		% Per UTF-16 spec, subtract 0x10000 from scalar value, resulting
		% in a 20-bit integer in range 0..0xFFFFF. 
		16#10000 sub        % any uni'
		% Extract upper 10 bits 
		dup 16#400 idiv     % any uni' upper_10_bits
		16#D800 add         % any uni' leading_surrogate
		exch                % any leading_surrogate uni'
		cvi2_uni            % any b0 b1 uni'
		4 -1 roll exch      % b0 b1 any uni'
		16#400 mod          % b0 b1 any lower_10_bits
		16#DC00 add         % b0 b1 any trailing_surrogate
		exch                % b0 b1 trailing_surrogate any
		cvi2_uni            % b0 b1 b2 b3 any
	}{
		% uni is small enough to encode as a single code unit
		% Extract upper 8 bits
		dup 16#100 idiv		% any uni b0
		exch 16#100 mod     % any b0 b1
		3 -1 roll           % b0 b1 any
	} ifelse                % b0 b1 (b2 b3)? any 
} bind def

% cvs_utf8: convert string with UTF-8 encoded contents into a string 
% with UTF-16BE contents which Acrobat accepts in DOCINFO pdfmark calls.
% See PDF32000_2008 "7.9.2.2 Text String Type".
% This function assumes the UTF-8 string is valid and well-formed.
/cvs_utf8 {  % (utf-8 string) cvs_utf8 (BOM utf-16BE string)
	[ 16#FEFF 3 -1 roll		% mark uni (utf-8 string)
	% loop invariant: two top stack positions are always either
	% 1. the previous unicode scalar, and an integer value of a leading 
	%    byte of the next character of the string, or
	% 2. an integer k being turned into a scalar, and an integer value 
	%    of a continuation byte of a multi-code unit character.
	% Below those two positions, are a mark, and 0 or more integer byte 
	% values for earlier characters of the string.
	% N.B. we use base-2 PostScript radix integers, e.g. 2#11000000 to 
	% show the bit patterns of the UTF-8 code units.
	{						% [ b... uni_or_k c
		dup 2#01111111 le {
			% c is a 1-code unit character and is already a new
			% unicode scalar value.
			% uni_or_k is the old unicode scalar value. Convert it to bytes. 
			cvi2_uni    	% [ b... b0 b1... uni_new
		}{
			% c is either a continuation code unit or the first code
			% unit of a new multi-code-unit character
			
			dup 2#10111111 le {
				% C is a trailing byte
				% uni_or_k is k, a scalar value in progress. Add to it.
								% [ b... k c
				% strip leading 1 bit off c
				2#10000000 sub  % [ b... k c'
				exch 			% [ b... c' k
				% shift k left by 6 bits, to make room for c'
				2#1000000 mul   % [ b... c' k-left-shifted
				add             % [ b... k'
			}{
				% C is lead byte of a 2-, 3-, or 4-byte sequence. 
				% uni_or_k is the old unicode scalar value. Convert it
				% to UTF-16. 
				cvi2_uni        % [ b... b0 b1... c
				
				% Figure out which prefix c has. 
				% The leading consecutive 1 bits, followed by a 0 bit,
				% are a prefix. Bits following the first 0 bit are the 
				% start of the scalar value, k. 
				
				% find largest prefix which is <= c				
				2#11000000		%  [ b... c prefix
				[2#11100000 2#11110000] {	% ... c prefix0 prefix1
					% if prefix1 <= c then reject prefix0 as < prefix1
					% if prefix1 > c then reject prefix1 as > c
					dup 3 index le {exch} if  % ... c prefix' rejected
					pop			% ... c prefix'
				} forall 		% ... c prefix % (largest prefix <= c)

				% subtract prefix, which strips leading 3-5 bits from c
				sub             % [ b... b0 b1... k
			} ifelse % dup 2#10111111 le
		} ifelse % dup 2#01111111 le 
							
	} forall % (utf8 string)
								% [ b... uni
	/any cvi2_uni pop	        % [ b... b0 b1 (b2 b3?)
	]                       	% [ b... ]
	dup length string       	% [ b... ] (      )
	0 1 2 index length 1 sub    % [ b... ] (      ) 0 1 len-1
	{							% [ b... ] (      ) i
		2 index 1 index get     % [ b... ] (      ) i b[i]
		2 index 3 1 roll put    % [ b... ] ( b[i] )
	} for 
	exch pop                    % (b...)
} bind def

The names cvs_utf8 and cvi2_uni are inspired by the PostScript language cvs and cvi operators, which convert various data types to strings and integers respectively. _utf8 alludes to “from UTF-8 string”. i2 alludes to two or four integers, to be used as byte values. _uni alludes to “from a Unicode scalar value (integer)”.

Here is some self-test code, to show you the conversion in use. (i excerpted this test code from another module, which explains the curious true {} if statement in which the test is wrapped, and the quit at the end of the test.)

% cvs_utf8_test.ps
% by Jim DeLaHunt, 2020. Contributed to the public domain.
%
true {
	% helper functions for test fixtures
	/eq_array {			% [array1] [array2] eq_array bool
		1 index length 1 index length eq
		{
			true		% [array1] [array2] eq_so_far
			0 1 3 index length 1 sub {
									% [array1] [array2] eq_so_far i
				3 index 1 index get % [array1] [array2] eq_so_far i a1[i]
				3 index 3 -1 roll get 	% [array1] [array2] eq_so_far a1[i] a2[i]
				eq and				% [array1] [array2] eq_so_far'
			} for
		}{
			false
		} ifelse
		3 1 roll pop pop	% eq_result
	} def
	/test_cvi2_uni {	% [ [[ref] [test]]... ] test_cvi2_uni -
		[ exch aload pop 		% ...[ ref test 
		{
			[ 1 index aload pop	% ...[ ref test [ uni any
			cvi2_uni			% ...[ ref test [ b0 b1 any -or- b0 b1 b2 b3 any
			] dup 3 index   	% ...[ ref test [ b0 b1 (b2 b3)? any ] [dup] ref
			eq_array			% ...[ ref test [ b0 b1 (b2 b3)? any ] bool
		} stopped {false} if
	} def
	/test_cvs_utf8 {	% [ [(ref) (test)]... ] test_cvs_utf8 -
		[ exch aload pop 		% ...[ (ref) (test) 
		{
			dup	cvs_utf8		% ...[ (ref) (test) <FEFF result>
			dup 3 index eq		% ...[ (ref) (test) <FEFF result> bool
		} stopped {false} if
	} def
	/report { 	% bool report - (printing progress or error indicator)
		{ 
			(.) print
		}{ 
			(\nTest failed! Stack, to first mark, is test:\n) print
			pstack
			(\n) print
		} ifelse
		cleartomark  % clear off fixture and all cruft
	} def

	%=====
	(cvi2_uni tests: ) print
	[ % test data pairs of: [[b0 b1 any] [uni any]]
		% or: [[b0 b1 b2 b3 any] [uni any]]
		[[16#00 16#00 []   ] [16#0000 []   ]]
		[[16#00 16#FF -1   ] [16#00FF -1   ]]
		[[16#01 16#00 ()   ] [16#0100 ()   ]]
		[[16#FF 16#FF (any)] [16#FFFF (any)]]
		[[16#D8 16#00 16#DC 16#00 123] [16#010000 123]]
		[[16#D8 16#00 16#DF 16#FF 1.2] [16#0103FF 1.2]]
		[[16#D8 16#01 16#DC 16#00 1.3] [16#010400 1.3]]
		[[16#DB 16#FF 16#DF 16#FF 321] [16#10FFFF 321]]
	] {
		test_cvi2_uni report
	} forall
	(\n) print

	%=====
	(cvs_utf8 tests: ) print
	[ % test data pairs of: [<utf16 string> <utf8 string>]
	    [<FEFF>   ()  ]
		[<FEFF0000> <00>]	[<FEFF007F> <7F>]
		[<FEFF0020> ( ) ] 	[<FEFF00FC> <C3BC> ]
		[<FEFF0080> <C280>]   [<FEFF00FF> <C3BF>]
		[<FEFF0100> <C480>]   [<FEFF07FF> <DFBF>]
		[<FEFF0800> <E0A080>] [<FEFFFFFF> <EFBFBF>]
		[<FEFFD800DC00> <F0908080>] [<FEFFDBFFDFFF> <F48FBFBF>]
	] {
		test_cvs_utf8 report
	} forall
	(\n) print
	
	quit  % skip the rest of the code below if testing conversion above
} if

When the tests run successfully, the output is like this:

% gs cvs_utf8_test.ps 
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
cvi2_uni tests: ........
cvs_utf8 tests: .............

Why was this fun and nostalgic? Thirty years ago, I started at the small software startup, Adobe Systems Inc., as a developer support engineer. My duty was to explain the PostScript language to software developers, and later to write the PostScript language part of the Adobe’s PostScript printer driver for Windows. Writing in the stack-oriented, postfix language was a big part of my early development career. But then I moved on, and stopped writing in the language. This was my first substantial PostScript language code in many years. I moved on, to Unicode and internationalisation work. I got to be very familiar with UTF-8 and UTF-16, during the years in which the Unicode standard was being created and the software industry gradually embraced it. The beauty of this project was that it combined these two great roads of my software development journey.

The PostScript market is no longer the hotbed of innovation and profit which it once was. However, PostScript interpreters are still widely used, and PostScript language code still runs through various buried software workflows. I have even earned consulting incomein the past few years as one of the remaining experts on the PostScript language. So maybe this code will be useful to someone.

I grant this code to the public domain, dedicated to the many software engineers in the PostScript language ecosystem, at Adobe and elsewhere, who helped me mature as a software engineer.

No Comments »

Culture, and software engineering, in British Columbia

PostScript code converting UTF-8 to UTF-16

Leave a Reply

Search

Tags

Archives

Pages