The structure of human knowledge - a cluster analysis of Wikipedia

The structure of human knowledge - a cluster analysis of Wikipedia

Participants
Emil, project lead
Anders “Serdan” Kehlet, programming

Description
How do the various academic and intellectual fields relate to each other? That’s what we want to find out. The idea is to do a massive scale cluster analysis...:
 

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.

Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify preprocessing and parameters until the result achieves the desired properties.

... on the articles on the English Wikipedia. Wikipedia is sufficiently large to act as a very good proxy for the entirety of human knowledge. It is also well suited for this because of the intern links, and the fact that one can download the entire content (no pictures) of Wikipedia to one’s hard drive. This makes it unnecessary to have a very fast internet connection. It does require a healthy amount of space, but nothing not workable for a normal desktop.

Why?
Because it’s cool fuck you if you disagree.
More seriously, it allows for very many interesting things. One can see how the various fields relate to each other. Do some fields relate to many other fields? Do some fields relate to only few other fields? Is there some field that is central such that it relates to pretty much all other fields? Perhaps. We can actually find out quantitatively by cluster analysis.

General method and initial analysis
First, for each page, make a list of pages that link to that page. This itself should be interesting, as it makes it possible to see what one page is the most linked to.
Second, the method can be extended by degrees. Next, for each page, how many pages links to that page either directly or through one other page? One can continue to do this analysis many times. I’m not entirely sure how many computer power it is going to take, but I think a lot at degrees >1. I suspect that degree=2 is doable though, and will reveal some interesting information.

After this initial analysis, the team should read up on cluster analysis and look for a suitable algorithm to make a graphical illustration.

Project diary

14. aug. 2012

Downloaded and unpacked the file. It is huge. Began programming to extract the relevant data using regex. It seems like a good idea to make another file out of the standard one, so as to reduce the size of it by removing irrelevant data.
Decided we want to ignore redirect pages.

30. aug. 2012

We have extracted the relevant data (the page data) from the main database file. Next step is creating the dual database: a database with a list of all the pages and which other pages they link to, and a database with a list of all the pages and which other pages link to them.
Emil have also worked on some different clustering methods. See Dropbox.

5. feb. 2013

I finally found some other people and research with similar ideas. I was comparing different sites with respect to their PageRank. And it naturally occured to me that I should have a look at the Wikipedia article for that. Then I discovered that PageRank is actually really similar to my proposed algorithms for doing a cluster analysis. Then I googled it, and found a paper that analyzed Wikipedia based on PageRank. Cite: Bellomi, Francesco, and Roberto Bonato. "Network analysis for Wikipedia." proceedings of Wikimania. 2005.

Abstract
Network analysis is a quantitative methodology for studying properties related
to connectivity and distances in graphs, with diverse applications like citation index-
ing and information retrieval on the Web. The hyperlinked structure of Wikipedia
and the ongoing, incremental editing process behind it make it an interesting and
unexplored target domain for network analysis techniques.
In this paper we apply two relevance metrics, HITS and PageRank, to the whole
set of English Wikipedia entries, in order to gain some preliminary insights on the
macro-structure of the organization of the corpus, and on some cultural biases related
to specific topics.

I immediately saw that it was relevant.

We have developed a specialized web crawler in order to download the whole content of the
English languageWikipedia. The crawler scans the ”All pages” index section of Wikipedia,
thus retrieving a complete list of the pages exposed on the web site; the ”special pages”
(that is, the pages not containing an entry) are then filtered out, by looking at specific
patterns in the URLs, and then all the proper pages have been downloaded and stored
locally, in order to perform the computation of the metrics. All the results discussed in
this paper come from a ”snapshot” of the Wikipedia corpus taken in a period of 34 hours,
between the 16th and the 17th of April 2005.

Our method is far superior, but it is impressive that they succeeded in doing it that way. Must have super fast internet!

5 Conclusions and further work
This first tentative work revealed some interesting facts about the cultural biases underlying
the overall structure of Wikipedia. Not surprisingly it seems clear that it is a resource
strongly biassed towards Western culture and history. In this sense it will be interesting
to apply the very same experiment to every local Wikipedia resource online to highlight
local cultural, historical or political biasses. Furthermore, although superficially similar,Pagerank and HITS algorithm seem to grasp whole different classes of concepts that deserve
further analysis.

One further step in the analysis of these first raw results will be to compare and classify
them by means of additional ”semantic” information. As a first stab Wordnet categories
of hyponimy and hypernimy could be used to cluster the results and display quantitative
properties with respect to the two chosen metrics.

We believe that, in spite of the conceptual simplicity of the experience we lead on
Wikipedia data, these first results show undoubtedly how network analysis tools can be
successfully applied to provide some non trivial, quantitatively grounded insights into such
a particular and interesting artifact of human knowledge.

It should be pretty easy to run the algorithm again, both for confirmation of the stats on EN Wiki, but also on other language Wikis. One can also run cross-language correlations.

I also took a look at papers citing this paper, since it might be that others had already done some research in that direction. That lead me to another paper: Pfeil, Ulrike, Panayiotis Zaphiris, and Chee Siang Ang. "Cultural differences in collaborative authoring of Wikipedia." Journal of Computer‐Mediated Communication 12.1 (2006): 88-113.

Abstract
This article explores the relationship between national culture and computer-mediated
communication (CMC) in Wikipedia. The articles on the topic game from the French,
German, Japanese, and Dutch Wikipedia websites were studied using content analysis
methods. Correlations were investigated between patterns of contributions and the four
dimensions of cultural influences proposed by Hofstede (Power Distance, Collectivism
versus Individualism, Femininity versus Masculinity, and Uncertainty Avoidance). The
analysis revealed cultural differences in the style of contributions across the cultures
investigated, some of which are correlated with the dimensions identified by Hofstede.
These findings suggest that cultural differences that are observed in the physical world
also exist in the virtual world.

And so on. Read for yourself!