January 11, 2005

Lexical authorities in an encyclopedic corpus: a case study with Wikipedia Permalink

I recently conducted an experiment with my friend and colleague Roberto Bonato: we tried to analyze the internal citation structure of Wikipedia using some techniques borrowed from network analysis. We presented some results at the International Colloquium on ‘Word structure and lexical systems: models and applications’ (held at the University of Pavia, Italy, last December).

Network analysis is concerned with properties related to connectivity and distances in graphs, with diverse applications like citation indexing and information retrieval on the Web. HITS (Hyperlink-Induced Topic Search) is an network analysis algorithm that has been successfully used for ranking web pages related to a common topic according to their potential relevance. HITS is based on the notions of hub and authority: a good hub is a page that points to several good authorities; a good authority is a page that is pointed at by several good hubs. HITS exclusively relies on the hyperlink relations existing among the pages, to define the two mutually reinforcing measures of hub and authority. It can be proved that for each page these two weights converge to fixed points, the actual hub and authority values for the page. Authority is used to rank pages resulting from a given query (and thus potentially related to a given topic) in order of relevance.

Wikipedia is a collaborative, open-content, online encyclopedia. Any user is free to independently create, edit and revise any entry. The process is governed by Wikipedia’s official neutral point of view (NPOV) policy, which requires that contributors work to avoid bias in writing articles and bans users guilty of vandalism. Contributors build upon each other’s changes and flawed edits are repaired in a constantly on-going reviewing process. To date, Wikipedia features more than 350,000 articles in English, and has just passed its millionth entry as a multilingual resource. Each Wikipedia entry is marked up with hyperlinks to other Wikipedia entries referred to in its definition.
The hyperlinked structure of Wikipedia and the ongoing, incremental editing process behind it make it an interesting and unexplored target domain for network analysis techniques. In particular, we explored the relevance of the notion of HITS’s authority on this encyclopedic corpus.

We’ve developed a crawler that extensively scans through the structure of English language Wikipedia articles, and that keeps track for each entry of all other Wikipedia articles pointed at in its definition. The result is a directed graph (roughly 350,000 nodes, and more than 5 millions links), which consists for the most part of a big loosely connected component. Then we applied the HITS algorithm to the latter, thus getting a hub and authority weight associated to every entry.
First results seem to be meaningful in characterizing the notion of authority in this peculiar domain. Highest-rank authorities seem to be for the most part lexical elements that denote particular and concrete rather than universal and abstract entities. More precisely, at the very top of the authority scale there are concepts used to structure space and time like country names, city names and other geopolitical entities (such as United States and many European countries), historical periods and landmark events (World War II, 1960s). Television, scientific classification and animal are the first three most authoritative common nouns.
Here is the list of the 300 'most authoritative' entries on Wikipedia.

This is a work in progress. We plan to design a set of experiments on set of words related by specific linguistic relationships like meronymy, hyponymy or other domain-specific commonalities.

See also: our more recent work on Wikipedia analysis.

3 Comments:

Jakob Voss said...

Very interesting! I cited it in my paper on Wikipedia research. I tell you when there is a preprint if you like.

April 17, 2005 10:23 PM  
Francesco Bellomi said...

Thank you Jakob! I would really like to read your paper when it's done!

April 17, 2005 10:23 PM  
Paolo Massa said...

Hi, my friend! One less friend in the set "friend without a blog" ;-)

cool!

i will blogroll you soon (i'm moving into another domain the blog and i have some problem at the moment in doing it).

and i will also read the paper about wikipedia of course!!!
thanks!

April 17, 2005 10:24 PM  

Post a Comment

<< Home