So here was the challenge. For my travel website I wanted to have a script that automatically retrieves snippets of information about destinations, so that they can be displayed as a bit of additional info on the appropriate pages. The source of the information is obvious: Wikipedia. Although Wikitravel may seem a better choice at first sight for this project, Wikitravel’s purpose is mostly to provide travel tips and alike; for general information, Wikipedia is certainly the best choice.
Wikipedia text can be used freely — under GFDL license — and also provides several methods for using their information. For power users, there are complete database dumps available. Nevertheless, for this small project that would be utter overkill. I do not have the space, nor do I feel the need to download and run scripts on several gigabytes of data for merely article snippets of about a hundred destinations.
A more targeted approach is therefore recommendable in this particular case. Luckily, a tool for a page-targeted approach to fetching is available as well, namely the Special:Export function. This method is always preferred over fetching the actual HTML pages, due to the strain caused on Wikipedia’s servers when parsing wikicode and converting it to HTML.
For my project, the Special:Export function would do nicely. This returns an XML document that contains the article text in wikicode between the <text></text> elements. Automatically identifying the information snippet (i.e.the first paragraph) is an interesting task, as Wikipedia articles in wikicode may contain many many elements before even starting with the actual text. Some of these include template tags, information tables, images and definitions. So firstly, all of these elements should be removed, which requires the writing of over a dozen — and sometimes quite elaborate — regular expressions. After that, we’ll have the first paragraph of information at the very start of the resulting text.
The script then finds the first paragraph by locating a text of minimally 200 characters that’s followed by two linebreaks. So, there we go: the long-sought information snippet has been identified. However, the story doesn’t end here. Wikipedia’s article text is still in wikicode, which means that there’s a lot of markup applied that doesn’t look very nice on web pages without further parsing. So all of Wikipedia’s markup has to be either removed, or replaced by its equivalents in HTML or BBCode. When that’s all done, the information snippet can be saved locally, and is ready for display on the webpage!
Dealing with redirection and disambiguation pages
This fetching method is looking for pages of information on countries, states/provinces and cities. So, based on an input of place names, this script assumes the existence of an article name and tries to fetch that URL (e.g. wikipedia.org/wiki/New_York). That only works out immediately in a number of cases. Sometimes one article name will be redirected to another page. On the Wikipedia website redirection takes place immediately. In the XML feeds this is not the case: the only text you will find is “#REDIRECT [[Page Name]]”. The script had to recognize this as well, and then fetch and parse the correct Wikipedia page instead.
Then there are the disambiguation pages. These can be distinguished from the article pages because the wikicode will either contain {{disambig}} or {{geodis}}, followed or preceded by a list of possible articles. The next question is: which page is the correct one? One could go about doing this through semantics analysis, but that’s quite an IR challenge.
Luckily, disambiguation within this project can be handled a whole lot easier. The thing is that articles on cities are usually named “City”, “City, Province” or “City, Country”. Since the state and country information is already available in my destination data set, finding disambiguation pages can be overcome by fetching and analyzing these alternative article names.
In conclusion
In this post I’ve described my method for automatically retrieving a relevant Wikipedia information snippet on destinations (cities, provinces, countries) from a set of names. I’ve built a generic PHP function which does this by itself; it simply has to be fed location names. The script now runs once a week on my web server as a cronjob and fetches Wikipedia information snippets that will be displayed on new destination pages of the travel website (the script also refreshes the information on already existing pages, so that it stays in sync with the latest Wikipedia article changes). The success rate of this method is quite satisfactory: I estimate that for about 95% of all locations an information snippet could be identified correctly. The script took me a good afternoon to program, but will continue to retrieve relevant information snippets, no matter how many new destination pages will be added to my travel website.
Update. A live example of this PHP function can be found at http://www.fuzzytravel.com/sandbox/wikithis.php.
Update 2. I’ve made the function’s code available: wikithisphp.txt.