Researcher degrees of freedom as sensitivity analysis

Researcher degrees of freedom refer to the choices researchers make when conducting a study. There are many choices to be made, where to collect data, which variables to include, etc. However, a large subset of the choices concern only the question of how to analyze the data. Still I have now done 100s of analyses rigorous enough to publish, I know exactly what this means. I will give some examples from a work in progress.

1. Which variables to use?

The dataset I began with contains 75 columns. Some of these are names and the like, but many of them are socioeconomic variables in a broad sense. Which should be used? I picked some by judgment call with prior S studies, but I left out e.g. population density, mean age, pct. of population <16/working/old age. Should these have been included? Maybe.

2. What to do with City of London?

In the study, I examine the S factor among London boroughs. There are 32 boroughs and the City of London. The CoL is fairly small which can be rise to sampling error and effects related to being a very peculiar administrative division.

Furthermore, many variables in the dataset lack data for CoL. So I was faced with the question of what to do with it. Some options: 1) Exclude it. 2) Use only the variables for which there is data for CoL, 3) use more variables than has data for CoL and impute the rest. I chose (1), but one might have gone with either of the three.

3. The extra crime data

I found another dataset with crime counts. I calculated per capita versions of these. There are two level of types of crime: broad and detailed. Which should be used? One could also have factor analyzed the data and used the general factor scores. Or calculated a unit-weighted score (standardized all variables, then score cases by average of each variable). I used detailed variables.

4. The extra GCSE data

I found another dataset with GCSE measures. These exist for both genders together and for each gender alone. There are 9 different variables to choose from. Which should be used? Same options as before too: factor scores or unit-weighted average. I selected one for theoretical reasons (similarity to other scholastic variables e.g. PISA) and because Jensen’s method supported this choice.

5. How to deal with missing data

Before factor analyzing the data, one has the question of how to deal with missing data. Aside from CoL, a few other cases had some missing data. Some options: 1) exclude them, 2) impute them with means, 2) impute with best guess (various ways!). Which should be done? I used imputation with multiple regression method, one could have used e.g. k nearest means imputation instead.

6. How to deal with highly correlated variables

Sometimes including variables that correlate very strongly or even perfectly can seriously throw off the factor analysis results because they color the general factor. If extracted multiple factors, they will form their own factor. What should be done with these? 1) Nothing, 2) exclude based on a threshold value of max allowed intercorrelation. If (2), which value should be used? I used |.9|, but |.8| is about equally plausible.

7. How to deal with highly mixed cases

Sometimes some cases just don’t fit the factor structure of the data very well. They are structural outliers or mixed. What should be done with them? 1) Nothing, 2) Use rank-order data, 3) Use robust regression (many methods), 4) Change outlier values (e.g. any value >|3| sd gets reduced to 3 sd., 5) exclude them. If (5), which thresholds should we use for exclusion cutoff? [no answers forthcoming]. I chose to do (1), (2) and (5) and only excluded the most mixed case (Westminster).

Researcher choices as parameters

I made many more decisions than the ones mentioned above, but they are the most important ones (i think, so maybe!). Normally, research papers don’t mention these kind of choices. Sometimes they mention them, but doesn’t report results by different choices. I suspect a lot of this is due to the hassle of actually doing all the combinations.

However, the hassle is potentially much smaller if one had a general framework for doing it with programming tools. So I propose that as general, one should consider these kind of choices as parameters and calculate results for all of them. In the above, this means e.g. results with and without CoL, different variable exclusion thresholds, different choices with regards to mixed cases.

Theoretically, one could think of it as a hyperspace where every dimension is a choice for one of these options. Then one could examine the distribution of results over all parameter values to examine the robustness of the results re. analytic choices.

I have already been doing this for the choice of dealing with mixed cases, but perhaps I should ramp it up and do it more thoroly for other choices too. In this case, the threshold for exclusion of variables and which set of crime variables to use are important choices.