Twanguages: a Language Census of Twitter
Posted by Jim DeLaHunt on 30 Jul 2009 at 11:42 pm | Tagged as: language, meetings and conferences, multilingual, Unicode, web technology
What “twanguage” do you “tweet”? Twitter, the buzzing conversation of brief web and SMS messsages, exploded into wide use in 2009. But just how wide? To how many countries has it spread? And into which languages? I’m aiming to find out.
I’ve started a project named “Twanguages”, a language census of a sample of Twitter’s global traffic. I’m curious: which are the top languages? Are #hashtags localised? How does language correlate with location? And which Unicode character is the most rarely used?
I’ll be presenting our results at the 33rd Internationalization and Unicode Conference (IUC33), held in San Jose, California, on October 14-16, 2009. I have a place cleared for a Twanguages project page, and I’ll post interim results there as they become available (right now it’s only a placeholder). Stay tuned!
For the last month, I’ve been gathering a corpus of tweets. Using Twitter’s “spritzer” Streaming API, and Data Mining Feed before that, I have a cron job which collects a sample about 10 times per hour. I’m getting about 200 tweets per hour at this stage. I estimate I have well over 100,000 tweets in the corpus so far. Twitter’s APIs make a lot more tweets available, and once my analysis code is ready to receive them I’ll enlarge my sample. I have a hunch, though, that I’ll learn a lot even with a narrow sample.
I have identified several language- and script-identification services, and I’ll invoke them on the tweet corpus, and compare their results. Part of what makes the problem interesting is that each text is so short (at most 140 characters; even less after removing identifiers and URLs).
There are many wonderful directions to take the investigation, e.g. correlating the language of the tweet with the language of the web page linked to from the tweet, or with the languages of the user’s name and description. I expect hashtags will turn out to be only sometimes in the same language as the rest of the text. I’m also seeing interesting plain-text art (e.g. <http://twitter.com/sex> at the moment). I can’t predict now exactly which directions will be most productive, but I am confident that the results will be interesting.
During the talk, I also plan on a small but interesting digression into how Twitter applies Unicode-related technology (UTF-8, NCRs, counting characters) in its APIs.
I’ve created a Twitter ID, @twanguages, which you may follow for news on the project. Right now it’s not saying much. (Don’t be misled by user @twanguage; that appears to be occupied but inactive.)
And by the way: let’s start using #iuc33 as the official hashtag for the conference, and broaden adoption of #unicode!