Wikipedia has grown from one of many interesting websites to being one of the most famous sites on the Internet. Millions of volunteer years have been invested over the years, and the pay off is what we have today – a wealth of factual data in one place.
When Wikis were a new concept, many predicted they would descend into chaos as they grew. In the case of Wikipedia the reverse is true. It seems to become increasingly well organised as the site develops. Rather than becoming more jumbled, the natural development of article conventions and the more planned use of standardised templates has created an increasingly neat and consistent structure.
This careful organisation of the prose leads to the interesting possibility of extracting more structured data from Wikipedia for alternative purposes, while staying true to the letter and spirit of the GFDL under which the material is licensed.
There’s the potential for a kind of semantic reverse engineering of article content. HTML pages could be scraped, and pages scoured for hints as to the meaning of each text fragment.
Applications could include loading articles about a variety of subjects into structured databases. Subjects for this treatment could include countries, people, chemical elements, diseases, you name it. These databases could then be searched by a variety of applications.
I’ve knocked up a simple page that gives a kind of quasi-dictionary definition when a word is entered. It looks at the first sentence of the Wikipedia article, which typically describes the article topic concisely.
I’ll show here how the basic page scrape works, which is actually very easy with PHP, its HTML reading abilities and the power of xpath.
- $html = @file_get_contents(“http://en.wikipedia.org/wiki/France”); will pull down the HTML content of the Wikipedia article on France.
- $dom = @DOMDocument::loadHTML($html); will read the HTML into a DOM for querying.
- $xpath = new domXPath($dom); will make a new xpath query.
- $results = $xpath->query(‘//div[@id=”bodyContent”]/p’); will find the first paragraph that is a direct child of the div with the id “bodyContent”. This is where the article always starts in a Wikipedia article page.
I then perform some more processing on the results including contingencies for if any of the steps fail. For instance to make the definitions snappier reading I strip any text in brackets, either round or square. There’s also some additional logic to pick the first topic in the list if the page lists multiple subjects (a “disambiguation” page). Predicting the Wikipedia URL for a given topic also involves a small amount of processing.Anyway, when you ask the page “what is France”, it will reply..
France, officially the French Republic, is a country whose metropolitan territory is located in Western Europe and that also comprises various overseas islands and territories located in other continents.
Can’t argue with that!
Edit, 1st March: By request, here is the source of the WhatIs application. It will work in any LAMP environment but the .sln file is for VS.PHP under Visual Studio 2005.