Again this year, I joined Vancouver open data enthusiasts in celebrating Open Data Day last Saturday. Despite limited time and schedule conflicts, I was able to make progress on an interesting project: a “dataset dataset” for the City of Vancouver’s Open Data Catalogue.

Motivation

The City of Vancouver publishes about 173 datasets (as of February 2014). The catalogue of these datasets is published on a web page at http://data.vancouver.ca/datacatalogue/index.htm . The format of this catalogue is an HTML table. This is useful for humans to manually read and click on links. It is not very useful for reading by machine, or automated searching, or interoperability. It would be nice to have a dataset that contains the Open Data Catalogue itself, in some machine-readable format. In other word, a dataset containing a dataset catalogue, or “dataset dataset”.

“Dataset dataset” is a confusing, self-referential phrase, but don’t be thrown off by that. It is meaningful and useful. For instance, my Vancouver Open Data language census project examines the contents of each dataset and generates some metadata about each dataset. Having the list of datasets in a spreadsheet to work from would save a lot of cutting and pasting of dataset names. One person at Open Data Day proposed a search engine for open data datasets; a machine-readable data catalogue would be much more convenient for this purpose. Even something as simple as counting the number of datasets is faster and more reliable to do from a dataset than from a human-readable page.

Some city staff responsible for the Vancouver Open Data were kind enough to attend Open Data Day. I buttonholed them asked them in person for a machine-readable form of the Open Data Catalogue. Later, the topic got some attention on the Vancouver Data mailing list.

And, I started writing down my idea and proposals at a page, Vancouver Open Data Catalogue dataset, in the Open Data Day wiki. (You will notice similarities to this blog. I reused a lot of my words in both places.)

Data catalogue datasets elsewhere

Other jurisdictions have published their data catalogues in dataset form. The Province of British Columbia has a DataBC Data Catalogue Content in delimted-text form. The Government of Canada has a Data.gc.ca Portal Catalogue, as a compressed archive of JSON lines (213MB compressed, that’s big!).

It is an open question what the format of such a dataset should be. Different jurisdictions are using different formats. A strong contender is the Common Core Metadata Schema, published by Project Open Data. They advocate a JSON representation of a standard set of attributes to describe each dataset. It builds on the Data Catalog Vocabulary (DCAT) specification by the W3C, which in turn draws on the Dublin Core Metadata and other sources. Project Open Data is focussed on opening US Government data, which means potentially many and influential datasets will follow their specs. They also have a delicious array of tools to help with formatting data catalogues.

Project Open Data hosts a Data Catalog JSON Validator (which as of Feb 2014 didn’t work for me). They also link to David Caraway’s validator, which yields a lot of errors for a Vancouver Open Data Catalogue where several required fields have null values.

JSON Lint, the JSON Validator can validate a data catalog for just JSON syntax correctness, without regard to the Common Core Metadata Schema. It also nicely formats and indents a JSON file.

Hacking an informal first draft of a dataset dataset

I’m confident that the City of Vancouver will likely get around to publishing their Open Data Catalogue in dataset form. They will be able to do it right. In particular, the Common Core Metadata Schema calls for several required fields, like contact point and modification date, which aren’t mentioned in the present HTML page catalogue. It’s better for the custodians of the data to author that.

But for I certainly did want a first draft of a dataset as a way of helping me populate the table for the Vancouver Open Data language census. And an HTML table is a scrapeable data format. So I hacked. It’s straightforward to paste the HTML table into an OpenOffice.org Calc spreadsheet. This preserves the table structure, but it yields cells which contain both link text and link target URLs. For a catalogue dataset, we want those two strings separately.

I was able to find some fine OpenOffice.org macro code which delivers link target URLs and other useful parts of the spreadsheet cell to spreadsheet formulæ. I’ll save those details for another blog (if you are impatient, I also posted them on the Open Data Day wiki).  These formulæ let me author a second spreadsheet, which assembled a new structure from the link text and targets of the HTML table.

I authored a new structure which was in pretty good conformance with the Common Core Metadata Schema, except where some required fields were given null values. That is now posted on the  Vancouver Open Data Catalogue dataset wiki page, where others might improve it. Here’s an excerpt:

[
    {
        "title": "3-1-1 case location details",
        "description": null,
        "keyword": null,
        "modified": null,
        "publisher": null,
        "contactPoint": null,
        "mbox": null,
        "identifier": null,
        "accessLevel": "public",
        "language": "en-CA",
        "landingPage": "http://data.vancouver.ca/datacatalogue/311caseLocationDetails.htm",
        "distribution": [
            {
                "accessURL": "http://data.vancouver.ca/datacatalogue/311caseLocationDetails.htm#details",
                "format": "text/csv"
            },
            {
                "accessURL": "http://data.vancouver.ca/datacatalogue/311caseLocationDetails.htm#details",
                "format": "application/vnd.ms-excel"
            }
        ]
    }
]

It’s straightforward to modify the same OpenOffice.org spreadsheet to generate a wikitext table for the Vancouver Open Data language census page.

80% of success is showing up

One-day project sessions like Open Data Day can be easy to blow off. This year, I had no specific project in mind beyond continuing last year’s project, which was more dutiful than inspiring. Because of other commitments, I only had a couple of hours available. The weather was nasty (by Vancouver standards). Would it really be worthwhile? But I showed up. And the result, although modest, is worthwhile. Now, it’s in a relevant wiki, where others can build on it.

Not to mention that, at the same Open Data Day, I got to meet the Digital Curator of Vancouver City Archives, which is a fascinating connection to make. And I got to hear the author of some fantastic analyses of data which illuminate Vancouver real estate.  And I could help others make connections.

As the famous philosopher Woody Allen said, “80% of success is showing up”.

[Update 2014-03-31: correct a spelling error.]