software engineering

Archived Posts from this Category

Ads Factory “GoogleX, GoogleY” means (lat, long) not (horizontal, vertical)

Posted by Jim DeLaHunt on 30 Jun 2011 | Tagged as: CMS, Joomla, robobait, software engineering, web technology

I want to pass along a tip about confusing field names used in the Ads Factory component for Joomla for geographic data.  I encountered this while customising this component for a client. At first I thought it was a bug, but now I think it’s just an odd naming convention.

Ads Factory, by Romanian developers The Factory,  is a commercial component for Joomla 1.5 which lets you add classified ads to your Joomla site. (My client had me working with version 1.x on Joomla 1.5, but I see there is also a version 2.1 of Ads Factory which is Joomla 1.6 native.) There are quite a few places where Ads Factory includes geographic information: each user record can record a latitude and longitude for that user; each ad can record a latitude and longitude for the advertised merchandise; and there is way to make a “radius search”, i.e. find all ads within a given distance of a user-specified location.

These latitude and longitude values are stored in database fields with name suffixes “X” and “Y”. The user’s latitude and longitude are stored in fields “GoogleX” and “GoogleY” of the Ads Factory user table. Similarly, but not completely consistently, the ad’s latitude and longitude are stored in fields “MapX” and “MapY” of the Ads Factory ads table. The confusion comes in understanding which field stores the latitude, and which stores the longitude.

Latitude is, of course, the signed number of degrees north of the equator of a point on the earth’s surface. It ranges from +90.0 (the North Pole) to 0.0 (the Equator) to -90.0 (the South Pole). Thus, it’s a vertical coordinate. Longitude is the signed number of degrees east of the 0° meridian (roughly Greenwich, England). It ranges from +180.0 to -180.0. My part of North America is 122-123° west of Greenwich, so we have longitudes of -123.0 to -122.0 or so. It’s a horizontal coordinate. This is a well-established convention in many mapping standards.

Tidy Cartesian mathematicians like me use the convention of (X,Y) coordinates, where X is the horizontal coordinate and Y is the vertical coordinate. This is a well-established convention in geometry and graphics (though there are some exceptions).

My first interpretation of Ads Factory field names like  “GoogleX” and “GoogleY” was to interpret them according to the Cartesian convention: X is horizontal, and so stores longitude, while Y is vertical, and so stores latitude. Thus (MapX, MapY) would be (longitude, latitude), the opposite of what one expects from mapping. Odd. I was surprised to find some parts of the code storing latitude in X (the horizontal coordinate!) and longitude in Y (the vertical!), which was surely a bug. I was horrified when it appeared that every part of this code had the same bug!

Then I understood the convention. Ads Factory’s developer appear to have used the (X, Y) convention to indicate just the order of the coordinates, but not their Cartesian meaning.  (MapX, MapY) means (latitude, longitude), as is conventional in mapping.  X is the vertical coordinate, Y is the horizontal coordinate, in the Ads Factory context. If you remember that X means “first”, not horizontal, and Y means “second”, not vertical, the Ads Factory field names are self-consistent, and the code uses them correctly.

I haven’t seen any Ads Factory documentation which explains this, so I hope this note will help some of you Ads Factory enhancers who are using these fields.

Postscript: what did my client ask me to do with Ads Factory for their site?  Modify the radius search to search around the user’s latitude and longitude, instead of a location the user enters. Also, to sort the keyword and category search results by distance from the user. Quite straightforward to do, though it requires customisations to the Ads Factory code that have to be re-done everytime one upgrades the Ads Factory component.

How to resolve EasyEclipse error ‘Eclipse… requires plug-in “system.bundle”‘

Posted by Jim DeLaHunt on 31 Mar 2011 | Tagged as: robobait, software engineering, web technology

I use the EasyEclipse distribution of Eclipse, the free (libre) software development environment. I just figured out how to fix an obscure error message:

Eclipse Web tools editors (2.0.1) requires plug-in "system.bundle"
Eclipse Data Tools (1.5.1) requires plug-in "system.bundle"

When I would start up EasyEclipse (version 1.3.1 for Mac OS X, with Python, C++, Java, PHP and more support added), it would tell me that I had some outdated components, and offer to update them for me.  But when I opened the menu item Help… Software Updates… Manage configuration, I would get the ominous error alert:

“The current configuration contains errors and this operation can have unpredictable results. Do you want to continue? [Cancel] [OK]”.

I wasn’t able to  find documentation about this problem specifically. (My purpose in writing this is to help others benefit from what I learned.)

Continue Reading »

Choosing between UTF-8 and UTF-16: which has the better bytes-per-character ratio?

Posted by Jim DeLaHunt on 31 Dec 2010 | Tagged as: Unicode, i18n, language, software engineering

Software engineers sometimes are called on to specify which encoding a text file format should use.  These days, the top contenders for encoding are UTF-8 and UTF-16, both based on the Unicode Standard. One factor (amongst several, and perhaps not the most compelling) in choosing between them is storage efficiency: the number of bytes per character, or amount of storage per unit of text. If a given text takes a kilobyte of storage in UTF-8 and twice that in UTF-16, that’s a difference, which may be meaningful.

I recently looked for quantitative data about space efficiency of UTF-8 and UTF-16, and couldn’t find very much. Engineering discussions about storage efficiency are better informed by quantitative data than by opinion and supposition. I want to give one morsel of quantitative data more visibility, and clarify this issue. Continue Reading »

11 Django gotchas

Posted by Jim DeLaHunt on 31 Aug 2010 | Tagged as: Python, Unicode, robobait, software engineering, web technology

This post has been a long time in the making. A year ago, I started work on my Twanguages code. This was code to analyse a corpus of Twitter messages, and try to discern patterns about language use, geography, and character encoding.  I decided to use the Django web framework and the Python language for the Twanguages analysis code.  I know Python, but I was learning Django for the first time.

Django is really, really marvellous.  When I tried this expression, and got the Python array of records I was expecting,

q2 = TwUser.objects.annotate(ntweets=Count('twstatus')).filter(ntweets__gt=1)

I wrote in my log, “I think I just fell in love. Power and concision in a tool, awesome.”

But Django gave me fits.  It has its share of quirks to trap the unwary novice. Eventually I began writing notes about “Django gotchas” in my log.  Some of them are Django being difficult, or inadequate. Some are me being a clueless novice, and Django not rescuing me from my folly. But all of them were obstacles.  I share them in the hopes of helping another Django novice.

Here are my Django gotchas.  They are ranked from the most distressing to most benign. They apply to Django 1.1, the current version at the time. (As of August 2010, the current version is 1.2.1.) A couple of gotchas were addressed by Django 1.2, so I moved them down to a section of their own. The rest presumably still apply to Django 1.2, but I haven’t gone back to check.

  1. API fails unhelpfully. I wrote a simple query expression like:
    S2 = models.TwStatus.objects.get( key )

    I got a lot of weird errors, e.g. “ValueError: too many values to unpack” (where key is string) and “TypeError: ‘long’ object is not iterable” (where key is long). I had made a mistake, of course; the call to get() should have a keyword argument of “id__exact” or the like, not a positional argument. The correct spelling is this:

    S2 = models.TwStatus.objects.get( id__exact=key )

    The gotcha is that Django’s .get() isn’t written defensively. It isn’t very robust to programmer errors. Instead of checking parameters and giving clear error messages, it lets bad parameters through, only to have them fail obscurely deep in the framework. If defensive programming of the Django API would slow it down too much in production, I’d love to have a debug mode I could invoke during development. Continue Reading »

Why the PostScript language is Turing-complete

Posted by Jim DeLaHunt on 30 Apr 2010 | Tagged as: software engineering

A couple of weeks ago on the XML-dev mailing list, there was a discussion comparing declarative and procedural computer languages. Someone wondered why the PostScript language, though used mostly for declarative purposes like describing pages, was still a Turing-complete programming language. That’s actually a topic I know something about, so I contributed the following answer. I’m posting it here, lightly edited, because I thought it might be of wider interest. —JDLH

A good place to go for a discussion of why it is Turing-complete, despite being intended to describe page appearance, is in the Introduction (Chapter 1) of the PostScript Language Reference Manual.

In particular, it says, “The extensive graphics capabilities of the PostScript language are embedded in the framework of a general-purpose programming language. The language includes a conventional set of data types, such as numbers, arrays, and strings; control primitives, such as conditionals, loops, and procedures; and some unusual features, such as dictionaries. These features enable application programmers to define higher-level operations that closely match the needs of the application and then to generate commands that invoke those higher-level operations. Such a description is more compact and easier to generate than one written entirely in terms of a fixed set of basic operations.” Continue Reading »

How to make standalone Django documentation on Mac OS X 10.5 using MacPorts.

Posted by Jim DeLaHunt on 06 Aug 2009 | Tagged as: Python, robobait, software engineering, web technology

One of the many nice touches of the Django framework is that it provides tools and instructions to make a standalone Django documentation set from its distribution.  (Django is an application framework for the Python language that helps with database access and web application.)  Standalone docs are great for people like me who work on a laptop and are sometimes off the net. But I’m using Mac OS X, I get my code through Macports, and Django’s instructions don’t quite cover this case.  So I just figured it out.  Here’s the tricks I needed.  Maybe it will help you.

Continue Reading »

Heads up for 1234567890 day!

Posted by Jim DeLaHunt on 12 Feb 2009 | Tagged as: Vancouver, software engineering, time

1000000000 seconds since the POSIX epoch, as celebrated in Denmark in 2001During a high school class, my teacher interrupted his discussion of classical Greek history to say, “it’s twelve thirty-four on the fifth of June, 1978″. In other words, 12:34 5/6/78 (in the British notation). Alert people in the United States had already celebrated that moment on May 6th. If you missed that moment, you have another chance on Friday: 1234567890 day.

Humans love to find patterns, and dates have rich potential for that. For instance, I was walking through a train station on a business trip in Tokyo in February, 1990. I noticed that people were making an unusual fuss about the train tickets. 1990 was 平成2年 , or “Heisei year 2″, in the calendar based on the Japanese era name. The date was printed on the train tickets as “H2-2-2″. The symmetry made them collectors items. (I wish I could lay my hands on a ticket from that day, to convince myself I didn’t invent this memory…)

I have a fondness for finding leaks in the software engineering abstractions that represent our messy real world.  I wrote last year about POSIX time, and the limitations in its representation of modern calendars and time zones. So when a leaky abstractions turns up as a pretty pattern, it’s irresistible.  And that’s what happens this Friday.

Continue Reading »

Simple script-detection algorithm for font switching?

Posted by Jim DeLaHunt on 26 Aug 2008 | Tagged as: Unicode, i18n, language, multilingual, software engineering

Does anybody know of a simple script-detection algorithm (or heuristic) for font switching?

This came up with one of my clients. Suppose you have a guest book on your web site, and seven visitors left you the following inspiring messages:

  1. すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについて平等である。
  2. 人人生而自由,在尊严和权利上一律平等。
  3. Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.
  4. 人人生而自由,在尊嚴和權利上一律平等。
  5. Alle Menschen sind frei und gleich an Würde und Rechten geboren.
  6. ‘Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.
  7. 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.

(It looks like your visitors all read the Universal Declaration of Human Rights courtesy of the UDHR in Unicode project).

Now suppose you are so touched that you want to lay out all seven messages in a PDF file, and print it out as a booklet.  You have a beautiful layout template, and various complementary fonts: Latin script, Japanese, Korean, simplified Chinese, Traditional Chinese, and Greek script.

Which font to you apply to each message?  More importantly, is there a simple heuristic by which software can make the choice? (More after the jump.)

Continue Reading »

“Do all languages in the world use Western numerals (1, 2, 3 etc) to express numerical values?”

Posted by Jim DeLaHunt on 02 Apr 2008 | Tagged as: culture, i18n, language, software engineering

One of the answers I occasionally write at LinkedIn Answers seemed worth reposting here. The question was: “Do all languages in the world use Western numerals (1, 2, 3 etc) to express numerical values?“. My answer (slightly revised):

The simple answer to your question is, “No”. Or, “Yes”. It depends which exact question you are asking.

Is it the case that all languages in the world use only Western numerals (usually known as “Arabic” or “Hindu-Arabic numerals“, by the way) to express numerical values? No. Many languages use multiple number forms, depending on context. In the English language, for example, a numerical value could be expressed with words (”one”) in text, Hindu-Arabic numerals (”1″) in a technical context, or Roman numerals (”i”, “I”) in lists. Arabic, Hindi, Japanese, and Chinese all have native characters to express numerical values, which are used in some contexts.

Do all languages in the world use Western numerals sometimes, in some contexts, to express numerical values? Yes — mostly, probably. The qualifications are because I hate to make generalisations about human culture; it’s so diverse. And, note that languages without written forms probably don’t use Hindu-Arabic numerals at all.

Is it the case that Western numerals are — in all cultures, in all contexts — the idiomatic, preferred way to express numerical values? No. They aren’t even sufficient for all contexts in English (viz “one”, “i”).

Do all cultures which use Western numerals to express numerical values do so in the same way? No. In particular, the punctuation between the whole and the fractional part of a number, and the grouping of digits, differ by cultures. North America uses “1,234,567.89″; many European cultures use “1.234.567,89″; I’ve seen Japanese texts that say “123,4567.89″. See the CLDR number format patterns, creating international number formats in Excel, and the user guide to ICU formatting numbers.

Let’s shift focus from expressing numbers in cultures to implementing numbers in software products.

If you were making priority decisions for a software product (that’s my background) to expand its market internationally, and that product expresses numerical values using Hindu-Arabic numerals in some contexts appropriate in North America, can you be confident that it’s the only system you’ll need to express numerical values? No. You of course need to look at the cultural requirements of each new market as you go. But I’m confident that over time, some market will require some system other than Hindu-Arabic numerals to express numerical values. So I’m confident that sooner or later, you will have to give that software product the ability to express numerical values in a variety of ways (i.e., to internationalise it).

Postscript: the questioner, LinkedIn product manager Minna King, was kind enough to mark this as the “Best Answer” of the six posted.

[Edited for clarity based on reader feedback.]

Times Change, or, We Live In Complex Times

Posted by Jim DeLaHunt on 24 Mar 2008 | Tagged as: history, software engineering, time

A few days ago, I had to set the clock on my GPS unit. My GPS unit! It talks to over 24 satellites, each of which has atomic clocks accurate to nanoseconds — yet it didn’t know that Daylight Savings Time began in March instead of April.

We software engineers use variety of abstractions to represent points and durations in time. We synchronise to time servers which are accurate to the millisecond. Behind them are atomic clocks. Our abstractions can represent times centuries in the past and future. They are tidy, regular abstractions. But Daylight Savings Time is a reminder that the reality of time is messy and irregular. It is affected in small ways by astronomy, and in large ways by politics and human idiosyncrasy. That’s why I had to manually set the clock on my GPS unit.

One of the thing I love about engineering is where technology meets human idiosyncrasy. The technology bends to support the idiosyncrasy, and the human idiosyncrasy bends to fit the technology. Time representation is one such case.

Many software systems have an abstraction that represents time in terms of the number of seconds since some special reference date and time. <time.h> is the POSIX manifestation of this, storing times as integer numbers of seconds and microseconds since a 1970 reference time. From the integers, the system computes human-friendly structures like years, months, dates, hours, and minutes (leap years are no big deal). There’s a mechanism for converting from Coordinated Universal Time (UTC) to local time in various time zones. This includes Daylight Savings Time. All very tidy.

But when you start to look hard at time zones, things start to get interesting. Most of us live in time zones which are whole numbers of hours before or after UTC. In Vancouver, Canada, local time is eight hours earlier than UTC during Pacific Standard Time, and seven hours earlier than UTC during Pacific Daylight Time. But the time in Newfoundland and Labrador, on Canada’s east coast, is 30 minutes rather than an hour offset from its neighbour to the west. Afghanistan, India, Iran, three zones in Australia, and Venezuela also have time zones 30 minutes out of phase with most of the world. Nepal and Chatham Island, New Zealand, are 45 minutes out of phase: when it’s 05:00h in Vancouver, it’s 01:45h on Chatham Island. Time zones were regularised greatly in the 19th and 20th centuries. Before then, some timezones were arbitrary minutes and seconds out of sync with each other. (The TZ data set, tzdata2008b.tar.gz or its successor, is a fascinating read, with lots of historical time zone information and bibilographies.)

Daylight Saving Time has a richer and more controversial history than you may have known. David Prerau’s book Seize the Daylight: The Curious and Contentious Story of Daylight Saving Time“, gives three centuries of that history. It’s reasonably well known in North America that some jurisdictions observe daylight saving time and some don’t. What’s less appreciated is that the rules for daylight savings time vary over time. Many software systems, even when they provide for time zones and daylight savings time, do so through static data tables. They get caught out when, as in the Energy Policy Act of 2005 in the USA, or every year in Israel, somebody decides to change daylight saving rules. Most software systems don’t have a way of describing all the historical changes in daylight saving time; they do really well to store one historical rule for a time zone. This is what happened to my GPS unit. I’ll have to manually set its daylight saving time until the maker comes up with a firmware patch that corrects the time zone tables.

About those time zone labels, like “CST”: they are ambigous! “CST” can mean “U.S./Canada Central Standard Time, Australian Central Standard Time, China Standard Time, or Cuba Summer Time“, points out Raymond Chen in “The Old New Thing“. Software systems generall don’t well with the ambiguity. Specifications for representing dates in plain text, such as the W3C’s profile of ISO 8601, “Date and Time Formats” or RFC 2822 - Internet Message Format, section 3.3. Date and Time Specification, are important because they offer a way to write dates unambiguously.

Another problem that crops up when time zone definitions vary, be it for daylight saving or other changes, is that it becomes more complicated to calculate time spans. If I want to calculate the number of seconds between April 1, 2005 08:00h and April 1, 2008 Vancouver local time, I’ll need to allow for the fact that the 2005 time is standard time but the 2008 time is daylight savings.

I’ll also need to allow for leap seconds. Remember that abstraction that each day has 24 hours, each hour has 60 minutes, and each minute has 60 seconds? Well, usually that’s true. But some minutes in UTC are defined to be 61 seconds long, so some days are 86,401 (instead of 86,400) seconds long. These leap seconds get added by the International Earth Rotation and Reference Systems Service (IERS) in order to keep the sun rising on time. It turns out the earth actually takes a bit longer than 24 hours * 60 minutes * 60 seconds to turn one day, so without leap seconds the sun would rise later and later UTC. ( You’ll be glad to know there will be no leap second on June 30, 2008. Way to turn, planet Earth!)

So to calculate that time span between 2005 and 2008 correctly, I’ll need to allow for the minute of December 31, 2005 23:59h UTC being 61 seconds long.

So seconds are small and finicky. Surely we can be confident about the date, right? Well, that’s complicated too.

Trivia question: on what date in 1917 did Russia’s “October Revolution” occur? Well, on November 7, of course! It was October 25 in the Julian calendar in use in Russia at the time (”old style”), but November 7 in the current Gregorian calendar (”new style”). They differ by 13 days this century. These calendar differences are widespread across the world over the last 1000 years. Until 1750, England’s civil year started on March 25, not January 1. Thus January 30 1649 (”new style”) was known at the time as January 30 1648 (”old style”). Wikipedia can tell you way more about these “Old Style” and “New Style” date complexities.

On many world maps, there is an “international date line” zig-zagging across the Pacific ocean. To the east of this line, local time is earlier than UTC; to the west, later than UTC. I hate to be the one to break it to you, but this line doesn’t actually exist. Time zones are the choice, essentially political, of human jurisdictions. That zig-zag on the map is the cartographer’s way of showing you which time zones are which side of UTC.

Being political, time zones can change dates. The Pacific island republic of Kiribati stretches from 172° E to 150° W longitude. Centered in the Gilbert Islands in a time zone 12 hours after UTC, upon independence it acquired islands to the east in time zones 11 and 10 hours before UTC. This meant that the ends of the country observed different dates, and only four days per work week overlapped. Effective January 1, 1995, Kiribati changed the Phoenix Island and Line Island time zones to 13 and 14 hours after UTC respectively. This changed their dates, so finally, the whole country was on the same date. But beware if you have to compute a time span between 1994 and 1995 for Kiribati, because their local time didn’t include December 31, 1994! Nevertheless, they didn’t (pace Wikipedia’s Geography of Kiribati article) move the International Date Line, just some time zones.

Usually the <time.h> abstraction of time, as a count of seconds and microseconds after a certain reference date and time, works plenty well for software engineering. There are certainly many sources of error in our time data (inaccurate clocks, clobbered time stamps) that are much greater than the limitations of this abstraction. But remember, time measurement is a human convention, and human conventions almost always have really interesting complexity, and variation over time.

Don’t let the tidiness of the abstraction blind you to the richness of the reality.

« Previous Page