Correcting for n-level discretization?

When one has a continuous variable and then cuts it into bins (discretization) and correlates it with some other variable, the observed correlation is biased downwards to some extent. The extent depends on the cut-off value(s) used and the number of bins. Normally, one would use polychoric-type methods (better called latent correlation estimation methods) to estimate what the correlation would have been.

However, for the purpose of meta-analysis, studies often report Pearson correlations with discretized variables (e.g. education as measured by 4 levels). They also don’t share their data, so one cannot calculate the latent correlations oneself. Often, authors won’t know what they are if one requests them. Thus, in general one need to correct for the problem using what one can see. Hunter and Schmidt in their meta-analysis book (Methods of Meta-Analysis) cover the case of how to correct for dichotomization, i.e. the special case of discretization with 2 levels. However, they do not seem to give a general formula for this problem.

I searched around but could not find anything useful on the topic. Thus, I attempted an empirical solution. I generated some data, then but it into 2-50 equal ranged bins. Then correlated these with the original. This gives the number one can use to correct values with. However, I don’t understand why the worst bias is seen for 3 levels instead of 2. See the plot below:

The X axis has the number of levels (bins, created with cut) and the Y axis the correlation with the continuous version. The points are based on a very large sample, so it is not a sampling error. Yet, we see that the bias is apparently worst for L=3, not L=2 as I was expecting.

The blue line is a LOESS smoothed fit, while the red line is a logistic-like fit I did with nls with the form: y = 1/(1 + a*exp(-levels*b)). The fit parameter values were a=0.644 and b=0.275. The reason I wanted a function is that sometimes one has to average sources with different numbers of levels (both within and between studies). This means that one must sometimes correct for a non-integer number of levels. The value to correct for that cannot be calculated by just trying since one obviously cannot chop data into e.g. 3.5 bins. Using the logistic function one would get the value of 0.802.

I still don’t understand why the most bias is seen with L=3, not L=2 as expected.

I will update my visualization about this once I have found a good solution.

You Might Also Like

Standard deviation of total SAT scores: not simply the sum of the standard deviations of subtests

R for machine learning and other stuff

African IQs using only age heaping data