One of the things that bother me with the current research on gender differences in personality is the reference group approach problem. Researchers ask questions that require participants to make comparisons between themselves and some implicit group. Gerhard Meisenberg pointed out the problem with this approach for cross-cultural research:

A potentially remediable source of low effect sizes and poor reproducibility in cross-cultural studies of personality is the reference group effect: the tendency of respondents to compare themselves with others in their social environment. For example, subjects rating themselves on a Likert – type question such as “I work hard to accomplish my goals” have no absolute scale on which to judge themselves. They will most likely choose their answer by comparing themselves to others of their age and social class. A forced-choice scenario – based question such as “Would you prefer to spend your vacation on a popular beach or in a log cabin in the woods?” is less likely to elicit the reference group effect.

Opinions about the importance of the reference group effect vary widely. Some researchers bel ieve that cross-cultural research of personality is fatally flawed because of the reference group effect. For example, Wood et al. (2012, p. 1275) conclude: “Personality judgments cannot be used to compare different populations when the population participants have different reference groups (as in cross-cultural research).” Also Heine et al. (2002) see the reference group effect as a serious obstacle: “Cross-cultural comparisons using subjective Likert scales are compromised because of different reference groups.” (p. 903) Other research, however, found that the reference group effect was too small to be noticed (Mõttus et al., 2012a). Thus the task is to develop personality inventories for cross-cultural research that minimize the reference group effect. This could be attempted by replacing judgments answered in a Likert-type format with forced-choice scenario-based questions. Another strategy is the use of anchoring vignettes to control for reference group effects (Mõttus et al., 2012a). These efforts may produce more reliable assessment tools in the future. For now, however, extreme caution has to be exercised when using the existing compilations of country-level personality traits.

Instead of the forced-choice, I’d like to try another idea: ratio scale measurements of frequencies of behavior obtained via self-report. Take neuroticism and crying. Gender differences in neuroticism usually show up and their size is not that large, d from .20 to .60 or so (1, 2). Here’s a table of gender differences in a number of traits measuring using the 16PF (source). The d’s from this paper are larger because they estimate latent variables, not use the observed ones (biased downwards due to measurement error).


I was able to find a meta-analysis of crying frequency, a ratio-scale behavioral indicator of neuroticism, which had d=.67 among adults. This value is possibly biased downwards because the researchers used the (arithmetic) means and standard deviations for their d measure (standardized mean difference), but crying frequency is very skewed, so they should have used medians and median absolute deviations instead, or log-transformed the data.

Since we cannot really hope to obtain the case-level data necessary for this re-analysis, we need to collect new data. To collect new data, we need to know what to ask about. So I want to find ~3-5 behavioral indicators of each of the big five traits, or other relevant aspects of personality. Here’s my initial ideas:


  1. Frequency of crying.


  1. Number of punished for violating the law.
  2. Amount of dirty clothing lying on the floor right now, number of.
  3. In last week, how many times more than 10 minutes late for an appointment or work.
  4. In the last 12 months, how many times did not pay for something on time due to forgetting about it (rent, phone, TV, internet bill, etc.).
  5. In the last 14 days, how many days did you brush your teeth at least once?
  6. [same as above, but with twice]


  1. In the last 12 months, number of times donated at least money [at least 100 DKK] to charities.
  2. In the last 14 days, number of times had a heated discussion with someone (heated = involved shouting).
  3. In the last 14 days, how many times paid something for someone without expecting a return of the favor.


  1. Psychedelic drug use (LSD, psilocin, ketamine, salvia, DMT, etc.).
  2. In the last 7 days, how many things read about on Wikipedia out of curiosity (approximate number).


  1. Frequency of going to social happenings.
  2. Number of friends.

As can be seen, I need some help in coming up with more behavioral manifestations of personality.

Some of these items will measure some mix of traits (e.g. number of friends is presumably some mix of A+E), which is not a problem for our purposes.

After we have come up with a few behaviors to ask about for each trait, I will collect some data on these as part of other planned research on gender differences and opinions regarding genders and gender politics. Planned sample size about N=500.