How to extract URLs with Apache OpenOffice, from formatted text and HTML tables

Posted by Jim DeLaHunt on 31 Mar 2014 | Tagged as: robobait, software engineering

I use and value a good spreadsheet application the way chefs use and value good knives. I have countless occasions to do ad-hoc data processing and conversion, and I tend to turn to spreadsheets even more often I turn to a good text editor. I know a lot of ways to get the job done with spreadsheets. But recently I learned a new trick. I’m delighted to share it with you here.

The situation: you have an HTML document, with a list of linked text. Imagine a list of projects, each with a link to a project URL (the names aren’t meaningful):

The task is to convert this list of formatted links into a table, with the project name in column A, and the URL in column B.  The trick is to use an OpenOffice macro, which exposes the URL (and other facets of formatted text) as OpenOffice functions. Continue Reading »

Open Data Day 2014, and a dataset dataset for Vancouver

Posted by Jim DeLaHunt on 28 Feb 2014 | Tagged as: Vancouver, government, meetings and conferences, web technology

Again this year, I joined Vancouver open data enthusiasts in celebrating Open Data Day last Saturday. Despite limited time and schedule conflicts, I was able to make progress on an interesting project: a “dataset dataset” for the City of Vancouver’s Open Data Catalogue.

Continue Reading »

A good-practice list of i18n API functionality

Posted by Jim DeLaHunt on 30 Nov 2013 | Tagged as: culture, i18n, meetings and conferences, multilingual, software engineering, web technology

Think of the applications programming interface (API) for an application environment: an operating system, a markup language, a language’s standard library. What internationalisation (i18n) functionality would you expect to see in such an API? There are some obvious candidates: a text string substitution-from-resources capability like gettext(). A mechanism for formatting dates, numbers, and currencies in culturally appropriate ways. Data formats for text that can handle text in a variety of languages. Some way to to determine what cultural conventions and language the user prefers. There is clearly a whole list one could make.

Wouldn’t it be interesting, and useful, to have such a list?  Probably many organisations have made such lists in the past. Who has made such a list? Are they willing to share it with the internationalisation and localisation community? Is there value in developing a “good practices” statement with such a list?  And, most importantly, who would like to read such a list? How would it help them? In what way would such a list add value? Continue Reading »

Trip report: IUC37

Posted by Jim DeLaHunt on 31 Oct 2013 | Tagged as: Joomla, Unicode, drupal, meetings and conferences

Delightful!  Last week I came home from the gathering of my trip, the 37th Internationalisation and Unicode Conference. My tutorial on Building multilingual websites in Drupal 7 and Joomla! 3 again went well. And I found inspiration, new knowledge, and old friends there.

Those of you looking for my slides and handouts, they are at the preceding link. You are welcome to share them, per their Creative Commons license. I’d appreciate credit when you share them. And I’d appreciate your feedback on this blog’s comments. Continue Reading »

“Building multilingual websites in Drupal 7 and Joomla 3″ (IUC37 tutorial)

Posted by Jim DeLaHunt on 30 Sep 2013 | Tagged as: Joomla, Unicode, drupal, meetings and conferences

I can’t believe I didn’t announce this before now. I’m delighted to be asked, once again, to present a tutorial on Building multilingual websites in Drupal 7 and Joomla! 3, at the 37th Internationalization and Unicode Conference (IUC37), this October in Santa Clara, California, USA.

This is my abstract, from the Unicode conference program for my talk: Continue Reading »

I still do

Posted by Jim DeLaHunt on 22 Aug 2013 | Tagged as: marriage equality, personal

Today back in 1998, my uncle Spencer Boise asked me, “Jim and Ducky, do you both recognize the rights and responsibilities inherent in the marriage contract?” and I replied, “I do. I have come here freely to take this woman to be my wife. I promise to love her, comfort her, honor her, and keep her, above all others.”
Continue Reading »

Top Posts: StackOverflow “How do I get SQLAlchemy to correctly insert a unicode ellipsis into a mySQL table?”

Posted by Jim DeLaHunt on 31 Jul 2013 | Tagged as: Unicode, robobait, software engineering

I post on various forums around the net, and a few of my posts there get some very gratifying kudos. I’ve been a diligent contributor to StackOverflow, the Q-and-A site for software developers. I’m in the top 15% of contributors overall, and one of the top 25 answerers of Unicode-related questions. Here’s my second best-voted answer in StackOverflow so far.

The question, How do I get SQLAlchemy to correctly insert a unicode ellipsis into a mySQL table?,  was asked by user kvedananda in February 2012. In abbreviated form, it was:

Continue Reading »

Top Posts: StackOverflow “Django headache with simple non-ascii string”

Posted by Jim DeLaHunt on 31 May 2013 | Tagged as: Python, Unicode, software engineering

I post on various forums around the net, and a few of my posts there get some very gratifying kudos. I’ve been a diligent contributor to StackOverflow, the Q-and-A site for software developers. I’m in the top 15% of contributors overall, and one of the top 25 answerers of Unicode-related questions. Here’s my top-voted answer in StackOverflow so far.

The question, Django headache with simple non-ascii string,  was asked by user Ezequiel in January 2010. In abbreviated form, it was:

Continue Reading »

Canada Post and USPS rate cards, 2013 rates

Posted by Jim DeLaHunt on 30 Apr 2013 | Tagged as: Canada, USA, robobait

Canada Post and the US Postal Service raised their postage rates again in January 2013. I was busy then, but I’ve grabbed a moment and updated my handy Canada Post and USPS postage rate quick reference card. The Canada Post rate increases were effective January 14, 2013, and the USPS increases were effective January 27.

My Canada Post and USPS Postage Rates project page,  http://jdlh.com/en/pr/postage_card.html, has links to download the latest charts as I update them.  The spreadsheet source file for the charts is also there. Both are licensed CC-BY-SA, so please feel free to re-use and modify them (as long as you attribute my work and share your product as freely).

Heads up: Canada Post has already received approval for first-class mail rate increases in 2014. The 2013 increases of both agencies came almost exactly one year after their 2012 increases, so I won’t be surprised if this becomes an annual event. The good news is that both Canada Post and USPS offer “perpetual” or “forever” stamps, which are worth first-class basic domestic postage, whatever the price may increase to.

Enjoy!

For OpenDataDay 2013, a language census of Vancouver’s datasets

Posted by Jim DeLaHunt on 28 Feb 2013 | Tagged as: Vancouver, culture, meetings and conferences, multilingual

OpenDataDay 2013 was celebrated last Saturday, February  23rd 2013, at over 100 hackathons and work days in 38 countries around the world. The City of Vancouver hosted a hackathon at Vancouver City Hall, and I joined in. My project was a language census of Vancouver’s open data datasets. Here’s what I set out to do.

Open Data is the idea that governments (and other bodies) publish data about their activity and holdings in machine-readable form, with loose terms of use, for citizens and other parties to use, and build upon, and add value to. Open Data Day rallies citizens and governments around the world “to write applications, liberate data, create visualizations and publish analyses using open public data to show support for and encourage the adoption open data policies by the world’s local, regional and national governments”.  I’m proud that local Vancouver open data leader David Eaves was one of the founders of Open Data Day. The UK-based Open Knowledge Foundation is part of the organisational foundation for OpenDataDay, but much of the energy is from local groups and volunteers (for example, the OKF in Japan).

Vancouver’s Open Data Day was a full house of some 80 grassroots activists, with attendance throughout the day by city staff, including Linda, the caretaker of the Vancouver Open Data portal and the voice of @VanOpenData on Twitter.  I missed the “Speed Data-ing” session in the morning, where participants could circulate among city providers of datasets to talk directly about was available and what each side wanted. I’m told that national minister the Honourable Tony Clement was also there (who now is responsible for the Government of Canada’s Open Data portal data.gc.ca, but who also in 2010 helped turn off the spigot of open data at its source by killing the long form census). I saw Councilmember Andrea Reimer there for the afternoon working session, and listening to the day-end wrap-ups, tweeting summaries of each project. I won’t try to describe all the projects. Take a look at the Vancouver Open Data Day 2013 wiki page, or the tweets tagged #vodhd13 (for Vancouver), and tagged #OpenData (worldwide).

I gave myself two goals for the hackathon. First, provide expertise and increased visibility for internationalisation and multi-lingual issues among the participants. Second, work on a modest project which would move internationalisation of local data forward.

My vision is that apps based on Vancouver open data should be localised into all the languages in which Vancouver residents want them. Over 30% of the people in the Vancouver region speak a language other than English at home, says Stats Canada. That is over  700,000 people of the 2.9m people in the area. Now of course localising those apps and web sites is a task for the developer. My discipline, internationalisation (i18n), is a set of design and implementation techniques to make it cheaper and easier to localise an app or web site. At some point, an app or web site presents data sourced from an open data dataset. In order for the complete user experience to be localised, the dataset also needs to be localised. A challenge of enabling localisation of open data-sourced apps is to set up formats, social structures, and incentive structures which makes it easier for datasets to get localised into the languages which matter to the end users.

To that end, I picked a modest project for the day. It was to make a language census of the city of Vancouver’s Open Data datasets. The link is to a project page I started on the Open Data Day wiki. I intended it to be a simple table describing the Vancouver, but it ended up with a good deal of explanation in the front matter.  I won’t repeat all that, but just give a couple of examples.

The 3-1-1 Contact Centre Interactions dataset (CSV format) has rows like (I’ve simplified):

Category1     , Category2     , Category3          , Mode    , 2012-11, 2012-12, 2013-1
CSG - Licenses, Animal Control, Dead Animals Pickup, Voice In,      22,      13,     13

While the Animal Control Inventory Deceased Animals dataset (CSV format) has rows like (again, simplified):

ID,  Date      ,CatOther   , Description              ,Sex,ACO            , Bag
7126,2013-02-23,SDC        , Tan/black medium hair cat,   ,Duty driver- JT, 13-00033
7127,2013-02-23,Dead Budgie,                          ,   ,Duty driver-JT , 13-00034
7128,2013-02-26,Cat        , Black and White          ,F  ,               , 13-00035

Note that most of the fields are simply data: dates, numbers, codes. These do not need to be localised. Some of the fields, like the Category fields in the 311 Interactions, are English-language phrases. But they are pulled from a controlled vocabulary, and so could be translated once into the target language, and would not usually need to be updated when new data is release. In contrast, a few fields in the Animal Control Inventory dataset, e.g. CatOther, Description, and ACO, seem to contain free text in English. Potentially, every new record in the dataset represents a new translation task.

The purpose of the language census is to go through the datasets in the Vancouver Open Data catalogue, and the fields for each dataset, and simply identify which fields are data, which are controlled vocabulary, and which are free text.  It’s not a major exercise. It doesn’t involve programming. Yet I believe it’s an important building block towards the vision of localised apps driven by open data.

Incidentally, this exercise inspired me to propose another dataset for the Vancouver catalogue: a dataset listing the datasets. There are 130 datasets in the Vancouver Open Data catalogue, and more are on the way. The only listing of them is an HTML page intended for human consumption. It would be nice to have a machine-readable table in CSV or XML format, describing the names and URLs and formats of the datasets in some structured way.

I’m happy to report success at my first goal, also. Several participants stopped by to talk with me about language support and internationalisation. I’m hopeful that it will help the non-English localisation of the apps, and city datasets, happen a little bit sooner.

If you would like to help in the language census, the project page is a wiki, and you are welcome to make constructive edits. See you there! Or, add a comment below.

Next »