Gender differences in humor abilities? - a data mining study


  • Emil Kirkegaard, project lead, idea, programming



Some people (notably, Christopher Hitchens, writing here and video here) have written that there are gender differences in humour abilities in humans, and given some evolutionary reasons why this is so.


We want to test this using data mining from the world's best information source, Wikipedia!



By following this general method

  1. Find some lists of funny people

  2. Count males and females

  3. Do some stats

  4. Present stats in a useful way


Lists of funny people

There are a few candidates for (1), the obvious one being:


But we want to explore cross-cultural differences, if any, as well, so one can use lists like:


And many more such lists can be found via Google using the keyword “list of comedians”.


There are also lists of rewards for doing funny stuff, like writing a fun book, for instance:


Counting males and females automatically

How might one do this? A simple method is counting the number of (English) gendered personal pronouns on the page of the funny person, specifically, counting instances of “he” and “she” (as words alone not as part of other words). Since articles first mentions the person, and then refer back to him or her with the words she, her, and he, his this can be used to determine the gender of the person in the article. Does this work? We picked a comedian at random, Bud Abbott, counted words, and the results are:

  • 6 matches for “ he “

  • 0 mathces for “ she “

  • 22 matches for “ his “

  • 1 match for “ her “

  • 1 match for “ him “

  • 0 matches for “ hers “


This result is very clearly in favor of the person in the article being male. There are various ways to deal with short articles having only a few instances of gendered personal pronouns. One could simply go by the majority of gendered pronouns. If no majority, ignore the result. This might introduce some measurement error, but it probably won't be a large effect. Another idea is to simply ignore pages with <5 gendered personal pronouns. Another idea is to just find the first gendered pronoun. Presumably, it will have the correct sex.

All the above methods should be utilized, so as to test which of them are best (test their inter-correlations), also in comparison with their computational requirement.


We note that this requires the comedian to have a page.



Stats are rather easy to do for this. We expect that a simple confidence interval will be sufficient.



It might be possible to control for age in the lists of funny people. This will be done by also gathering the birth date/age of the funny person. Perhaps there has been a change in the gender ratio over the years. Especially interesting will be numbers from people born after the 1970s because of the effect, if any, of second wave feminism.



As of 6. fen. 2013.

Raw data here. Code here.