Extracting a relevant sentence

Posted February 13, 2007 by James
Categories: PHP, Web Development

Say you’ve built a PHP / MySQL FULL-TEXT search functionality, and want to display a relevant sentence for each result, in a way like Google does. I’ve written a function that does just that, and thought I’d share it here.

The function takes a multiple sentence text string as input, and a search string. The text is then broken up into sentences with a minimum specified character length (as to avoid any sentences that merely consist of ‘Mr.’, ‘Hi!’, or ‘What?’ ;) ). The first sentence that matches the search string, will then be returned. If no match is found, simply the first sentence of the text will be returned.

Although it’s not perfect, this function generally does the job quite nicely. Download extract_sentencephp.txt here.

For more information on building your own search functionality, have a look at this MySQL FULLTEXT Searching tutorial, and of course at the official documentation.

Lightbox

Posted January 20, 2007 by James
Categories: Web 2.0, Web Design

Lightbox is a very useful JavaScript implementation that allows you to overlay images on the current page in Web 2.0 style. Implementing it is easy: you simply include three JavaScript files and one CSS file in the <head></head> section of the pages you want to use it on. After that, add a rel="lightbox" attribute to the links you want to use Lightbox on, and you’re done!

Initially I was worried about the size of these included files together, being around 70 kB. But if you deliver your JavaScript and CSS files gzipped to the requesting browsers, the total file size is only 17 kB. A number I find quite acceptable, especially considering the cool effects you get in return.

Geotagging the blogosphere

Posted January 14, 2007 by James
Categories: Information Retrieval

If you were to identify and pinpoint blog posts’ location data (i.e. mentioned locations in blog posts), a process known as geotagging or geocoding, these visualizations are the result. In the first and fourth picture, the color represents the popularity of a location (red is highest frequency). In the second and third picture, the size of the square represents the popularity of a location.

The original data set consisted of about 800 randomly chosen blogs with around 80.000 posts. Geotagging was done with ClearForest, Google Maps and World Gazetteer.

world.gif

europe-2.gif

united-states-2.gif

new-york.gif

The meaning of these location data is — unfortunately — relatively arbitrary.

RSS Mime type

Posted January 6, 2007 by James
Categories: Web Development

What’s the most correct MIME type for RSS feeds? For many developers this matter is a bit of a headache, as there are many differences with regard to application and browser support for application/rss+xml. For that reason, there seems to be a general preference for text/xml, as it’s the safest bet. So after reading Dave Winer’s advice on the matter, I decided to use the text/xml MIME type for my website’s RSS feeds as well.

My view on the most correct MIME type has changed recently though. The cause of this is that — for whatever reason — Google started ranking my website’s RSS feeds higher than the actual pages, which is obviously not very desirable, nor for the visitor, nor for me.

My initial response was to block Googlebot from indexing any of the RSS feeds via the robots.txt file. A bit later, realizing that this ’solution’ would also quite effectively erase all feeds from Google Blogsearch, I decided to switch to the application/rss+xml MIME type after all.

Luckily, the RSS Advisory Board agrees with my action.

Crazy Egg

Posted January 5, 2007 by James
Categories: Web Design

Crazy Egg is a wonderful free tool to visualize your visitors’ clicking behavior on your website, and in my opinion a must for every web designer.

crazyegg-fuzzytravel.jpg

Digitizing my CD collection

Posted January 5, 2007 by James
Categories: Uncategorized

With the recent acquisition of an external 250 GB hard disk, I’ve finally decided to take the plunge and digitize my entire audio CD collection. In this blog post I just wanted to share how I go about doing that.

The tool I use for converting audio to digital is GoldWave (together with LAME). They provide a fully functional, free ‘try-out’ version, that includes access to the wonderful CD Reader tool, which can convert analog audio tracks to a digital format of choice.

I convert my CD tracks to MP3 files (I think the choice for this format is quite obvious). In the settings I select a frequency of 44100 Hz, and a bitrate of 160 kbps, in stereo. These settings will lead to MP3 files that are of a good quality, yet will not become too large in size.

One thing I really love about the CD Reader tool is the ‘Get Titles’ button. When clicking this, GoldWave will contact freedb.org and retrieve the correct album name and song titles, which — needless to say — saves you a lot of typing.

Going about it this way, converting any CD collection to MP3 is a relatively easy and quick process. An entire CD will be digitized in about 15-20 minutes and ends up at ~60 MB in file size.

Fetching Wikipedia content

Posted December 21, 2006 by James
Categories: Information Retrieval, PHP

So here was the challenge. For my travel website I wanted to have a script that automatically retrieves snippets of information about destinations, so that they can be displayed as a bit of additional info on the appropriate pages. The source of the information is obvious: Wikipedia. Although Wikitravel may seem a better choice at first sight for this project, Wikitravel’s purpose is mostly to provide travel tips and alike; for general information, Wikipedia is certainly the best choice.

Wikipedia text can be used freely — under GFDL license — and also provides several methods for using their information. For power users, there are complete database dumps available. Nevertheless, for this small project that would be utter overkill. I do not have the space, nor do I feel the need to download and run scripts on several gigabytes of data for merely article snippets of about a hundred destinations. ;)

A more targeted approach is therefore recommendable in this particular case. Luckily, a tool for a page-targeted approach to fetching is available as well, namely the Special:Export function. This method is always preferred over fetching the actual HTML pages, due to the strain caused on Wikipedia’s servers when parsing wikicode and converting it to HTML.

For my project, the Special:Export function would do nicely. This returns an XML document that contains the article text in wikicode between the <text></text> elements. Automatically identifying the information snippet (i.e.the first paragraph) is an interesting task, as Wikipedia articles in wikicode may contain many many elements before even starting with the actual text. Some of these include template tags, information tables, images and definitions. So firstly, all of these elements should be removed, which requires the writing of over a dozen — and sometimes quite elaborate — regular expressions. After that, we’ll have the first paragraph of information at the very start of the resulting text.

The script then finds the first paragraph by locating a text of minimally 200 characters that’s followed by two linebreaks. So, there we go: the long-sought information snippet has been identified. However, the story doesn’t end here. Wikipedia’s article text is still in wikicode, which means that there’s a lot of markup applied that doesn’t look very nice on web pages without further parsing. So all of Wikipedia’s markup has to be either removed, or replaced by its equivalents in HTML or BBCode. When that’s all done, the information snippet can be saved locally, and is ready for display on the webpage! :D

Dealing with redirection and disambiguation pages

This fetching method is looking for pages of information on countries, states/provinces and cities. So, based on an input of place names, this script assumes the existence of an article name and tries to fetch that URL (e.g. wikipedia.org/wiki/New_York). That only works out immediately in a number of cases. Sometimes one article name will be redirected to another page. On the Wikipedia website redirection takes place immediately. In the XML feeds this is not the case: the only text you will find is “#REDIRECT [[Page Name]]”. The script had to recognize this as well, and then fetch and parse the correct Wikipedia page instead.

Then there are the disambiguation pages. These can be distinguished from the article pages because the wikicode will either contain {{disambig}} or {{geodis}}, followed or preceded by a list of possible articles. The next question is: which page is the correct one? One could go about doing this through semantics analysis, but that’s quite an IR challenge.

Luckily, disambiguation within this project can be handled a whole lot easier. The thing is that articles on cities are usually named “City”, “City, Province” or “City, Country”. Since the state and country information is already available in my destination data set, finding disambiguation pages can be overcome by fetching and analyzing these alternative article names.

In conclusion

In this post I’ve described my method for automatically retrieving a relevant Wikipedia information snippet on destinations (cities, provinces, countries) from a set of names. I’ve built a generic PHP function which does this by itself; it simply has to be fed location names. The script now runs once a week on my web server as a cronjob and fetches Wikipedia information snippets that will be displayed on new destination pages of the travel website (the script also refreshes the information on already existing pages, so that it stays in sync with the latest Wikipedia article changes). The success rate of this method is quite satisfactory: I estimate that for about 95% of all locations an information snippet could be identified correctly. The script took me a good afternoon to program, but will continue to retrieve relevant information snippets, no matter how many new destination pages will be added to my travel website.

Update. A live example of this PHP function can be found at http://www.fuzzytravel.com/sandbox/wikithis.php.

Update 2. I’ve made the function’s code available: wikithisphp.txt.