OpenDataDay 2013 was celebrated last Saturday, February  23rd 2013, at over 100 hackathons and work days in 38 countries around the world. The City of Vancouver hosted a hackathon at Vancouver City Hall, and I joined in. My project was a language census of Vancouver’s open data datasets. Here’s what I set out to do.

Open Data is the idea that governments (and other bodies) publish data about their activity and holdings in machine-readable form, with loose terms of use, for citizens and other parties to use, and build upon, and add value to. Open Data Day rallies citizens and governments around the world “to write applications, liberate data, create visualizations and publish analyses using open public data to show support for and encourage the adoption open data policies by the world’s local, regional and national governments”.  I’m proud that local Vancouver open data leader David Eaves was one of the founders of Open Data Day. The UK-based Open Knowledge Foundation is part of the organisational foundation for OpenDataDay, but much of the energy is from local groups and volunteers (for example, the OKF in Japan).

Vancouver’s Open Data Day was a full house of some 80 grassroots activists, with attendance throughout the day by city staff, including Linda, the caretaker of the Vancouver Open Data portal and the voice of @VanOpenData on Twitter.  I missed the “Speed Data-ing” session in the morning, where participants could circulate among city providers of datasets to talk directly about was available and what each side wanted. I’m told that national minister the Honourable Tony Clement was also there (who now is responsible for the Government of Canada’s Open Data portal data.gc.ca, but who also in 2010 helped turn off the spigot of open data at its source by killing the long form census). I saw Councilmember Andrea Reimer there for the afternoon working session, and listening to the day-end wrap-ups, tweeting summaries of each project. I won’t try to describe all the projects. Take a look at the Vancouver Open Data Day 2013 wiki page, or the tweets tagged #vodhd13 (for Vancouver), and tagged #OpenData (worldwide).

I gave myself two goals for the hackathon. First, provide expertise and increased visibility for internationalisation and multi-lingual issues among the participants. Second, work on a modest project which would move internationalisation of local data forward.

My vision is that apps based on Vancouver open data should be localised into all the languages in which Vancouver residents want them. Over 30% of the people in the Vancouver region speak a language other than English at home, says Stats Canada. That is over  700,000 people of the 2.9m people in the area. Now of course localising those apps and web sites is a task for the developer. My discipline, internationalisation (i18n), is a set of design and implementation techniques to make it cheaper and easier to localise an app or web site. At some point, an app or web site presents data sourced from an open data dataset. In order for the complete user experience to be localised, the dataset also needs to be localised. A challenge of enabling localisation of open data-sourced apps is to set up formats, social structures, and incentive structures which makes it easier for datasets to get localised into the languages which matter to the end users.

To that end, I picked a modest project for the day. It was to make a language census of the city of Vancouver’s Open Data datasets. The link is to a project page I started on the Open Data Day wiki. I intended it to be a simple table describing the Vancouver, but it ended up with a good deal of explanation in the front matter.  I won’t repeat all that, but just give a couple of examples.

The 3-1-1 Contact Centre Interactions dataset (CSV format) has rows like (I’ve simplified):

Category1     , Category2     , Category3          , Mode    , 2012-11, 2012-12, 2013-1
CSG - Licenses, Animal Control, Dead Animals Pickup, Voice In,      22,      13,     13

While the Animal Control Inventory Deceased Animals dataset (CSV format) has rows like (again, simplified):

ID,  Date      ,CatOther   , Description              ,Sex,ACO            , Bag
7126,2013-02-23,SDC        , Tan/black medium hair cat,   ,Duty driver- JT, 13-00033
7127,2013-02-23,Dead Budgie,                          ,   ,Duty driver-JT , 13-00034
7128,2013-02-26,Cat        , Black and White          ,F  ,               , 13-00035

Note that most of the fields are simply data: dates, numbers, codes. These do not need to be localised. Some of the fields, like the Category fields in the 311 Interactions, are English-language phrases. But they are pulled from a controlled vocabulary, and so could be translated once into the target language, and would not usually need to be updated when new data is release. In contrast, a few fields in the Animal Control Inventory dataset, e.g. CatOther, Description, and ACO, seem to contain free text in English. Potentially, every new record in the dataset represents a new translation task.

The purpose of the language census is to go through the datasets in the Vancouver Open Data catalogue, and the fields for each dataset, and simply identify which fields are data, which are controlled vocabulary, and which are free text.  It’s not a major exercise. It doesn’t involve programming. Yet I believe it’s an important building block towards the vision of localised apps driven by open data.

Incidentally, this exercise inspired me to propose another dataset for the Vancouver catalogue: a dataset listing the datasets. There are 130 datasets in the Vancouver Open Data catalogue, and more are on the way. The only listing of them is an HTML page intended for human consumption. It would be nice to have a machine-readable table in CSV or XML format, describing the names and URLs and formats of the datasets in some structured way.

I’m happy to report success at my first goal, also. Several participants stopped by to talk with me about language support and internationalisation. I’m hopeful that it will help the non-English localisation of the apps, and city datasets, happen a little bit sooner.

If you would like to help in the language census, the project page is a wiki, and you are welcome to make constructive edits. See you there! Or, add a comment below.