Clear Language, Clear Mind

July 29, 2017

Why the race and intelligence question is still not resolved

Filed under: Differential psychology/psychometrics,Metascience — Tags: , — Emil O. W. Kirkegaard @ 17:39

It could probably have been resolved decades ago, and definitely within the last 10 years with genomic data, yet it is still not. Why? Essentially, it’s because of bias in academia. It begins early: data access, then there’s authors’ own publication bias, then finally editorial and reviewing bias (all caused by lack of political/belief diversity in academia). Here’s a recent example of the first, data access bias. How can there be this bias? Because academia has been hoarding the data that the public has funded them to gather, refusing to share it with anyone openly. Well, actually academics just threw away most of the data (‘we accidentally all the data’), but the data they didn’t just let perish, most of is safe-guarded for privacy reasons so others can’t use it to publish without the guy who collected it (‘steal his ideas’), and especially so nefarious characters (aka. political opponents) cannot get it.

We were applying to a large dataset that would probably be able to settle the race and intelligence, or at least, provide very strong evidence for or against genetics as cause. But instead we got:

I asked Prof. X about this project, and even though he does recognise its relevance, I am afraid that he declined to provide the data. The reason is that country C is facing a very delicate political situation at the moment, and race/ethnicity/ancestry is one of the topics at the core of debates and etc. Everyone in the country seems to be a little bit cautious when it comes to looking at ethnicity, especially regarding this such as intelligence or violence.
I am truly sorry that we will not be able to help at this opportunity, but I do wish the best of luck with your research.

And onwards we go towards applying for the next dataset.

July 27, 2017

Large scale sex, race etc. discrimination studies: what do they show?

Filed under: Psychology,Sociology — Tags: , , , , , — Emil O. W. Kirkegaard @ 01:56

Given enough motivation, QRPs, biased reviewing and time, one can build an entire literature of studies proving anything. There’s plenty of all of these to prove left-wing ideological beliefs (and libertarian in economics). However, it is much harder to QRP large N datasets to give preferred results. So, what do large scale studies show about sex, race etc. biases in hiring, grading etc.? Here’s an attempt at a collection. Ping me anywhere if you know of any more.

Teaching accreditation exams reveal grading biases favor women in male-dominated disciplines in France

Discrimination against women is seen as one of the possible causes behind their underrepresentation in certain STEM (science, technology, engineering, and mathematics) subjects. We show that this is not the case for the competitive exams used to recruit almost all French secondary and postsecondary teachers and professors. Comparisons of oral non–gender-blind tests with written gender-blind tests for about 100,000 individuals observed in 11 different fields over the period 2006–2013 reveal a bias in favor of women that is strongly increasing with the extent of a field’s male-domination. This bias turns from 3 to 5 percentile ranks for men in literature and foreign languages to about 10 percentile ranks for women in math, physics, or philosophy. These findings have implications for the debate over what interventions are appropriate to increase the representation of women in fields in which they are currently underrepresented.

National hiring experiments reveal 2:1 faculty preference for women on STEM tenure track

National randomized experiments and validation studies were conducted on 873 tenure-track faculty (439 male, 434 female) from biology, engineering, economics, and psychology at 371 universities/colleges from 50 US states and the District of Columbia. In the main experiment, 363 faculty members evaluated narrative summaries describing hypothetical female and male applicants for tenure-track assistant professorships who shared the same lifestyle (e.g., single without children, married with children). Applicants’ profiles were systematically varied to disguise identically rated scholarship; profiles were counterbalanced by gender across faculty to enable between-faculty comparisons of hiring preferences for identically qualified women versus men. Results revealed a 2:1 preference for women by faculty of both genders across both math-intensive and non–math-intensive fields, with the single exception of male economists, who showed no gender preference. Results were replicated using weighted analyses to control for national sample characteristics. In follow-up experiments, 144 faculty evaluated competing applicants with differing lifestyles (e.g., divorced mother vs. married father), and 204 faculty compared same-gender candidates with children, but differing in whether they took 1-y-parental leaves in graduate school. Women preferred divorced mothers to married fathers; men preferred mothers who took leaves to mothers who did not. In two validation studies, 35 engineering faculty provided rankings using full curricula vitae instead of narratives, and 127 faculty rated one applicant rather than choosing from a mixed-gender group; the same preference for women was shown by faculty of both genders. These results suggest it is a propitious time for women launching careers in academic science. Messages to the contrary may discourage women from applying for STEM (science, technology, engineering, mathematics) tenure-track assistant professorships.

Going blind to see more clearly: unconscious bias in Australian Public Service shortlisting processes

In characteristic spin language:

This study assessed whether women and minorities are discriminated against in the early stages of the recruitment process for senior positions in the APS, while also testing the impact of implementing a ‘blind’ or de-identified approach to reviewing candidates. Over 2,100 public servants from 14 agencies participated in the trial 1 . They completed an exercise in which they shortlisted applicants for a hypothetical senior role in their agency. Participants were randomly assigned to receive application materials for candidates in standard form or in de-identified form (with information about candidate gender, race and ethnicity removed). We found that the public servants engaged in positive (not negative) discrimination towards female and minority candidates:

• Participants were 2.9% more likely to shortlist female candidates and 3.2% less likely to shortlist male applicants when they were identifiable, compared with when they were de-identified.

• Minority males were 5.8% more likely to be shortlisted and minority females were 8.6% more likely to be shortlisted when identifiable compared to when applications were de-identified.

• The positive discrimination was strongest for Indigenous female candidates who were 22.2% more likely to be shortlisted when identifiable compared to when the applications were de-identified.

Interestingly, male reviewers displayed markedly more positive discrimination in favour of minority candidates than did female counterparts, and reviewers aged 40+ displayed much stronger affirmative action in favour for both women and minorities than did younger ones. Overall, the results indicate the need for caution when moving towards ’blind’ recruitment processes in the Australian Public Service, as de-identification may frustrate efforts aimed at promoting diversity 2 .

Funnel plot for “Racial Bias in Mock Juror Decision-Making”

Meta-analysis of juror decision making studies finds that Whites show no own-group bias, but Blacks do.

June 27, 2016

Cognitive ability, surveys and measurement error: possible bias in GxE studies

Filed under: Genetics / behavioral genetics — Tags: , , — Emil O. W. Kirkegaard @ 13:38

I saw this paper at random:

The consequences of heavy alcohol use remain a serious public health problem. Consistent evidence has demonstrated that both genetic and social influences contribute to alcohol use. Research on gene-environment interaction (GxE) has also demonstrated that these social and genetic influences do not act independently. Instead, certain environmental contexts may limit or exacerbate an underlying genetic predisposition. However, much of the work on GxE and alcohol use has focused on adolescence and less is known about the important environmental contexts in young adulthood. Using data from the young adult wave of the Finnish Twin Study, FinnTwin12 (N = 3,402), we used biometric twin modeling to test whether education moderated genetic risk for alcohol use as assessed by drinking frequency and intoxication frequency. Education is important because it offers greater access to personal resources and helps determine one’s position in the broader stratification system. Results from the twin models show that education did not moderate genetic variance components and that genetic risk was constant across levels of education. Instead, education moderated environmental variance so that under conditions of low education, environmental influences explained more of the variation in alcohol use outcomes. The implications and limitations of these results are discussed.

It seems to me that such designs are biased towards finding interactions effects because:

  • Filling out surveys requires cognitive ability; smarter people will fill out surveys with fewer errors (e.g. will misunderstand fewer questions).
  • Making errors in filling out surveys results in measurement error.
  • Measurement error reduces heritability estimates.
  • Cognitive ability correlates and is causally related to many things, but especially educational attainment.

Or, to put it in more coherent writing: smarter people will fill out surveys with fewer errors. Errors in filling out surveys results in smaller correlations with one’s kin. Smaller kin correlations leads to lower estimates of heritability in behavioral genetic studies. So, heritability estimates for traits will tend to be higher for smarter people. This result also holds for any other correlate of cognitive ability such as educational attainment.

June 16, 2016

Measurement error and behavioral genetics in criminology

I am watching Brian Boutwell’s (Twitter, RG) talk at a recent conference and this got me thinking.

What are we measuring?

As far as I know, there are typically two outcome variables used in criminological studies:

  1. Official records convictions.
  2. Self-reported criminal or anti-social behavior.

But exactly what trait are we trying to measure? It seems to me that we (or I am!) are really interested in measuring something like tendency to break laws that are harmful to other people. Harmful is here used in a broad sense. Stealing something may not always cause someone harm, but it does deprive them (usually) unfairly of their property. Stealing is not always wrong, but it is usually wrong. Let’s call the construct we want to measure harmful criminal behavior.

Measurement error: two types

Before going on, it is necessary to distinguish between the two types of measurement error in classical test theory:

  1. Random measurement error.
  2. Systematic measurement error.

Random measurement error is by definition error in measurement that is not correlated with anything else at all (sampling error aside). Conceptually we can think of it as adding random noise to our measurements. A simple, every-day example of this would be a study where we examine the relationship between height and GPA for ground/elementary school students. Suppose we obtain access to a school and we measure the height of all the students using a measurement tape. Then we obtain their GPAs from the school administration. Random measurement error here would be if we used dice to pick random numbers and added/subtracted these to each student’s height.

Systematic measurement error (also called bias) is different. Suppose we are measuring the ability of persons to sneak past a guard post because we want to recruit a team of James Bond-type super spies. We conduct the experiment by having people try to sneak past a guard post. Because we have a lot of people to test, our experiment is carried out all day beginning in the early morning and ending in the evening. Each individual has to try three times to sneak past the guard post and we measure their ability as the number of times they sneaked past (so 0-3 are possible scores) We assign their trials in order of their birthdays: people born early in the year take their trials in the early morning. Because it is easier to see when the sun is higher in the sky, the individuals who happen to be born later or very early in the year have an advantage: it is more difficult for the guards to spot them when it is darker. Someone who successfully sneaked past the guards three times in the evening is not necessarily at the same skill level as someone who sneaked three times around noon. There is a systematic error in the measurement of sneaking ability related to the time of testing, and it is furthermore related to the persons’ birthday.

Problems with official records

Using official records as a measure of harmful criminal behavior has a big problem: they often include convictions for things that aren’t wrong (e.g. drug use or sex work). Ideally, we don’t care about convictions for things like smoking cannabis because in a sense, this isn’t a real crime: it’s just the government that is evil. In the same way that homosexual sex or even oral sex is not a crime anymore, and was not a real crime back when it was illegal (overview of US ‘sodomy’ laws). There is a moral dimension as to what to one is trying to measure if one does not just want to go with the construct of ‘any criminal behavior that the present day state in this country happen to have criminalized’.

Furthermore, official records are based on court decisions (and pleas). Court decisions are in turn the result of the police taking up a case. If the police are biased — rightly or wrongly — in their decision about which cases to pursue, this will give rise to systematic measurement error.

Since the police does not have infinite resources, they will not pursue every case they know of. They probably won’t even pursue every case they know of they think they can win in court. There is thus an inherent randomness in which cases they will pursue. i.e. random measurement error.

Worse, which cases the police pursues may depend on irrelevant things like whether the police leadership has set a goal for the number of cases of a given type that must be pursued every year. This practice seems to be fairly common, and yet it results in serious distortions in the use of police resources. In Denmark, the police often have these goals about biking violations (say, biking on the sidewalk). The result is that in December (if the goal is based on a year-to-year basis), if they are not close to meeting their goal, the police leadership will divert resources away from more important crimes, say, break-ins, to hand out fines for people breaking biking laws. They may also lower the bar as to what counts as a violation.

Even worse, they may focus on targeting violations that are not wrong they are easy to pursue. One police officer gave the following story (anonymously in order to prevent reprisals from the leaders!) in response to a parliament discussion of the topic:

“When we are told that we must write 120 bikers [hand out fines to] the next 14 days, then we don’t place ourselves in the pedestrian area while there are pedestrians, and when the bikers may cause problems. No, we take them in the morning when they bike thru the empty pedestrian area on their way to work, because then we get more quickly to the 120 number. In other words, we do it for the numbers’ sake and not for the sake of traffic safety.”

This kind of police behavior induce both random and systematic measurement error in the official records. For instance, people who happen to bike to work and whose work is on the opposite side of a pedestrian area are more likely to receive such fines.

Measurement error, self-rating and the heritability of personality traits

While personality is probably not really that simple to summarize, most research on personality use some variant of the big five/OCEAN model (use this test). Using such measures, it has generally been found that the heritability of OCEAN traits is around 40%. Lots of room for environmental effects, surely. Unfortunately, most of the non-heritable variance is in the everything else-category.

But, these results are based on self-rated personality and not even corrected for random measurement error which is usually easy enough to do. So, suppose we correct for random measurement error, then perhaps we get to 50% heritability. This is because (almost?) any kind of measurement error biases heritability downwards.

What about self-rating bias? Surely there are some personality traits that cause people to systematically rate themselves different from how other people rate themselves, i.e. systematic measurement error. Even for height — a very simple trait — using self-reported height deflated heritability by about 4% compared with clinical measurement (from 91 to 87%), and clinical measurement is not free of random measurement error either. Furthermore, human height varies somewhat within a given day — a kind of systematic measurement error.

So, are other-ratings of personality better? There is a large meta-analysis showing that other-ratings are better. They have stronger correlations with actual criteria outcomes than self-ratings:

Other_rating_strangersother_rating_academic other_rating_workperf

This suggests considerable systematic measurement error in the self-ratings. The counter-hypothesis: others’ ratings of one’s personality, while not actually more accurate than self-ratings, causally influences the chosen outcomes, such that it appears that other-ratings are better. E.g. teachers/supervisors give higher grades/performance ratings to those they incorrectly judge to be more open minded due to some kind of halo effect. I don’t know of any research on this question.

Still, what do we find if we analyze the heritability of personality using other-ratings and especially the combination of self- and other-ratings? We get this:

other_self_heritability

A mean heritability of 81% for the OCEAN traits. Like the height study, there was evidence of heritable influence on systematic self-rating error (53% in this study, the height study found 36% but had limited precision).

Conclusion: measurement error and criminology

Back to criminology. We have seen that:
  1. Official records have serious problems with measuring the right construct (criminal harmful behavior), probably suffer from lots of random measurement error and probably some systematic measurement error.
  2. Self-ratings suffer from systematic measurement error.
  3. Measurement error biases estimates of heritability downwards.
We combine them and derive the conclusion: heritabilities of harmful criminal behavior are probably seriously underestimated.
Questions for future research:
  • Locate or do behavioral genetic studies of crime based on multiple methods and other-ratings. What do they show?
  • Find evidence to determine whether the higher validity of other ratings is due to their higher precision or due to causal halo effects.

March 2, 2016

Correcting for n-level discretization?

Filed under: Math/Statistics — Tags: , , , , , — Emil O. W. Kirkegaard @ 07:17

When one has a continuous variable and then cuts it into bins (discretization) and correlates it with some other variable, the observed correlation is biased downwards to some extent. The extent depends on the cut-off value(s) used and the number of bins. Normally, one would use polychoric-type methods (better called latent correlation estimation methods) to estimate what the correlation would have been.

However, for the purpose of meta-analysis, studies often report Pearson correlations with discretized variables (e.g. education as measured by 4 levels). They also don’t share their data, so one cannot calculate the latent correlations oneself. Often, authors won’t know what they are if one requests them. Thus, in general one need to correct for the problem using what one can see. Hunter and Schmidt in their meta-analysis book (Methods of Meta-Analysis) cover the case of how to correct for dichotomization, i.e. the special case of discretization with 2 levels. However, they do not seem to give a general formula for this problem.

I searched around but could not find anything useful on the topic. Thus, I attempted an empirical solution. I generated some data, then but it into 2-50 equal ranged bins. Then correlated these with the original. This gives the number one can use to correct values with. However, I don’t understand why the worst bias is seen for 3 levels instead of 2. See the plot below:

discretization

The X axis has the number of levels (bins, created with cut) and the Y axis the correlation with the continuous version. The points are based on a very large sample, so it is not a sampling error. Yet, we see that the bias is apparently worst for L=3, not L=2 as I was expecting.

The blue line is a LOESS smoothed fit, while the red line is a logistic-like fit I did with nls with the form: y = 1/(1 + a*exp(-levels*b)). The fit parameter values were a=0.644 and b=0.275. The reason I wanted a function is that sometimes one has to average sources with different numbers of levels (both within and between studies). This means that one must sometimes correct for a non-integer number of levels. The value to correct for that cannot be calculated by just trying since one obviously cannot chop data into e.g. 3.5 bins. Using the logistic function one would get the value of 0.802.

I still don’t understand why the most bias is seen with L=3, not L=2 as expected.

I will update my visualization about this once I have found a good solution.

September 17, 2015

Rationality and bias test results

I’ve always considered myself a very rational and fairly unbiased person. Being aware of the general tendency for people to overestimate themselves (see also visualization of the Dunning-Kruger effect), this of course reduces my confidence in my own estimates of these. So what better to do than take some actual tests? I have previously taken the short test of estimation ability found in What intelligence tests miss and got 5/5 right. This is actually slight evidence of underconfidence since I was supposed to give 80% confidence intervals. This of course means that I should have had 1 error, not 0. Still, with 5 items, the precision is too low to say whether I’m actually underconfident or not with much certainty, but it shows that I’m unlikely to be strongly overconfident. Underconfidence is expected for smarter people. A project of mine is to make a test with the confidence intervals that is much longer so to give more precise estimates. It should be fairly easy to find a lot of numbers for stuff and have people give 80% confidence intervals for the numbers. Stuff like the depth of the deepest ocean, height of tallest mountain, age of oldest living organism, age of the Earth/universe, dates for various historical events such as ending of WW2, beginning of American Civil war, and so on.

However, I recently saw an article about a political bias test. I think I’m fairly unbiased. As a result of this, my beliefs don’t really fit into any mainstream political theory. This is as expected because the major political ideologies were invented before we understood much about anything, thus making it unlikely that they would tend to get everything right. More likely, they would probably get some things right and some things wrong.

Here’s my test results for political bias:

Screenshot from 2015-09-16 20:38:35 Screenshot from 2015-09-16 20:41:05

So in centiles: >= 99th for knowledge of American politics. This is higher than I expected (around 95th). Since I’m not a US citizen, presumably the test has some bias against me. For bias, the centile is <= 20th. This result did not surprise me. However, since there is a huge floor effect, this test needs more items to be more useful.

Next up, I looked at the website and saw that they had a number of other potentially useful tests. One is about common misconceptions. Now since I consider myself a scientific rationalist, I should do fairly well on this. Also because I have read somewhat extensively on the issue (Wikipedia list, Snopes and 50 myths of pop psych).

Unfortunately, they present the results in a verbose form. Pasting 8 images would be excessive. I will paste some of the relevant text:

1. Brier score

Your total performance across all quiz and confidence questions:
85.77%

This measures your overall ability on this test. This number above, known as a “Brier” score, is a combination of two data points:

How many answers you got correct
Whether you were confident at the right times. That means being more confident on questions you were more likely to be right about, and less confident on questions you were less likely to get right.

The higher this score is, the more answers you got correct AND the more you were appropriately confident at the right times and appropriately uncertain at the right times.

Your score is above average. Most people’s Brier’s scores fall in the range of 65-80%. About 5% of people got a score higher than yours. That’s a really good score!

2. Overall accuracy

Answers you got correct: 80%

Out of 30 questions, you got 24 correct. Great work B.S. detector! You performed above average. Looks like you’re pretty good at sorting fact from fiction. Most people correctly guess between 16 and 21 answers, a little better than chance.

Out of the common misconceptions we asked you about, you correctly said that 12/15 were actually B.S. That’s pretty good!

No centiles are provided, so it is not evident how this compares to others.

3. Total points

Screenshot from 2015-09-16 22:12:06

As for your points, this is another way of showing your ability to detect Fact vs. B.S. and your confidence accuracy. The larger the score, the better you are at doing both! Most people score between 120 and 200 points. Looks like you did very well, ending at 204 points.

4. Reliability of confidence intervals

Reliability of your confidence levels: 89.34%

Were you confident at the right times? To find out, we took a portion of your earlier Brier score to determine just how reliable your level of confidence was. It looks like your score is above average. About 10% of people got a score higher than yours.

This score measures the link between the size of your bet and the chance you got the correct answer. If you were appropriately confident at the right times, we’d expect you to bet a lot of points more often when you got the answer correct than when you didn’t. If you were appropriately uncertain at the right times, we’d expect you to typically bet only a few points when you got the answer wrong.

You can interpret this score as measuring the ability of your gut to distinguish between things that are very likely true, versus only somewhat likely true. Or in other words, this score tries to answer the question, “When you feel more confident in something, does that actually make it more likely to be true?”

5. Confidence and accuracy

Screenshot from 2015-09-16 22:16:37

When you bet 1-3 points your confidence was accurate. You were a little confident in your answers and got the answer correct 69.23% of the time. Nice work!

When you bet 4-7 points you were underconfident. You were fairly confident in your answers, but you should have been even more confident because you got the answer correct 100% of the time!

When you bet 8-10 points your confidence was accurate. You were extremely confident in your answer and indeed got the answer correct 100% of the time. Great work!

So, again there is some evidence of underconfidence. E.g. for those I betted 0 points, I still had 60% accuracy, tho it should have been 50%.

6. Overall confidence

Your confidence: very underconfident

You tended to be very underconfident in your answers overall. Let’s explore what that means.

In the chart above, your betting average has been translated into a new score called your “average confidence.” This represents roughly how confident you were in each of your answers.

People who typically bet close to 0 points would have an average confidence near 50% (i.e. they aren’t confident at all and don’t think they’ll do much better than chance).
People who typically bet about 5 points would have an average confidence near 75% (i.e. they’re fairly confident; they might’ve thought there was a ¼ chance of being wrong).
People who typically bet 10 points would have an average confidence near 100% (i.e. they are extremely confident; they thought there was almost no chance of being wrong).

The second bar is the average number of questions you got correct. You got 24 questions correct, or 80%.

If you are a highly accurate better, then the two bars above should be about equal. That is, if you got an 80% confidence score, then you should have gotten about 80% of the questions correct.

We said you were underconfident because on average you bet at a confidence level of 69.67% (i.e. you bet 3.93 on average), but in reality you did better than that, getting the answer right 80% of the time.

In general, results were in line with my predictions: high ability + general overestimation + imperfect correlation of self-rated ability and actual ability results in underconfidence. My earlier result indicated some underconfidence as well. The longer test gave the same result. Apparently, I need to be more confident in myself. This is despite the fact that I scored 98 and 99 on the assertiveness facet on the OCEAN test on two different test taking sessions with some months in between.

I did take their additional rationality test, but since this was just based on pop psych Kahneman-style points, it doesn’t seem very useful. It also uses typological thinking because it classifies people into 16 classes, clearly wrong-headed. It found my weakest side to be planning fallacy, but this isn’t actually the case because I’m pretty good at getting papers and projects done on time.

Update 2018-11-10

David Pinsen (who is a trader with skin in the game for prediction making) sent me an additional test to try: http://confidence.success-equation.com/. It follows the same approach as the confidence test with the US states, i.e. bunch of forced TRUE/FALSE questions and a question about certainty. My results were therefore unsurprisingly similar to before (close to perfect calibration) but somewhat overconfident this time (by 7%points). The test was too short for their number of options, they should have combined some of them. Annoyingly, they didn’t provide a Brier score, but I calculated it to be .20. It is surprisingly difficult to find guidelines for this metric, but in Superforecasters, a case story of given of a superforecaster (i.e. top predictor) called Doug. He had an initial score of .22 and was top #5 that year. However, this was calculated with the multi-class version of the formula (see FAQ), so his score was .11 on the scale I used. The average forecaster was .37 or 0.185 on the 0-1 scale. So, appears I did slightly worse than the average forecaster in Good Judgment Project. This result is probably pretty good considering that this is a self-selected group of people who spend a lot of time making forecasts based on any information they choose, while I didn’t use any help here just guessed based on whatever was in my head already. The test above provided a brier score as well, but gave it as a percentage. This might be a reversed version (i.e. (1-brier)*100), and if so, my new score would be 81.5% compared to 86.8 before%.

Note: the 60% point is based on n = 4 of which I got 1 right.

December 24, 2014

Correlations and likert scales: What is the bias?

Filed under: Math/Statistics — Tags: , , , — Emil O. W. Kirkegaard @ 16:27

A person on ResearchGate asked the following question:

How can I correlate ordinal variables (attitude Likert scale) with continuous ratio data (years of experience)?
Currently, I am working on my dissertation which explores learning organisation characteristics at HEIs. One of the predictor demographic variables is the indication of the years of experience. Respondents were asked to fill in the gap the number of years. Should I categorise the responses instead? as for example:
1. from 1 to 4 years
2. from 4 to 10
and so on?
or is there a better choice/analysis I could apply?

My answer may also be of interest to others, so I post it here as well.

Normal practice is to treat likert scales as continuous variable even though they are not. As long as there are >=5 options, the bias from discreteness is not large.

I simulated the situation for you. I generated two variables with continuous random data from two normal distributions with a correlation of .50, N=1000. Then I created likert scales of varying levels from the second variable. Then I correlated all these variables with each other.

Correlations of continuous variable 1 with:

continuous2 0.5
likert10 0.482
likert7 0.472
likert5 0.469
likert4 0.432
likert3 0.442
likert2 0.395

So you see, introducing discreteness biases correlations towards zero, but not by much as long as likert is >=5 level. You can correct for the bias by multiplying by the correction factor if desired:

Correction factor:

continuous2 1
likert10 1.037
likert7 1.059
likert5 1.066
likert4 1.157
likert3 1.131
likert2 1.266

Psychologically, if your data does not make sense as an interval scale, i.e. if the difference between options 1-2 is not the same as between options 3-4, then you should use Spearman’s correlation instead of Pearson’s. However, it will rarely make much of a difference.

Here’s the R code.

#load library
library(MASS)
#simulate dataset of 2 variables with correlation of .50, N=1000
simul.data = mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical = TRUE)
simul.data = as.data.frame(simul.data);colnames(simul.data) = c(“continuous1″,”continuous2”)
#divide into bins of equal length
simul.data[“likert10”] = as.numeric(cut(unlist(simul.data[2]),breaks=10))
simul.data[“likert7”] = as.numeric(cut(unlist(simul.data[2]),breaks=7))
simul.data[“likert5”] = as.numeric(cut(unlist(simul.data[2]),breaks=5))
simul.data[“likert4”] = as.numeric(cut(unlist(simul.data[2]),breaks=4))
simul.data[“likert3”] = as.numeric(cut(unlist(simul.data[2]),breaks=3))
simul.data[“likert2”] = as.numeric(cut(unlist(simul.data[2]),breaks=2))
#correlations
round(cor(simul.data),3)

Powered by WordPress