I looked at whether there was evidence for cognitive dysgenics in the OKCupid dataset. The unrepresentativeness of the dataset is not much of a problem here: indeed we are very much interested in younger people looking to date since these are the parents of the next generation and some are already parents of course.

The cognitive test is made of an ad hoc collection of 14 items. It seems to work fairly well. The items were scored using item response theory.

There are a couple of questions related to fertility, either desired or actual. The relationships to cognitive ability looks like this:

These are the raw relationships. One may wonder if they are confounded with, e.g. race or age. So one could try a multiple regression. Unfortunately, since the outcomes are ordinal data, standard multiple regression results in a large downward bias. Still, results look like this:

$coefs Beta SE CI.lower CI.upper CA -0.07 0.01 -0.08 -0.06 d_age -0.10 0.01 -0.11 -0.09 gender: Other -0.69 0.09 -0.86 -0.52 gender: Woman 0.01 0.01 -0.02 0.03 race: Mixed 0.13 0.02 0.09 0.17 race: Asian 0.16 0.03 0.11 0.22 race: Hispanic / Latin 0.12 0.03 0.07 0.18 race: Black 0.29 0.03 0.23 0.34 race: Other 0.06 0.03 -0.01 0.12 race: Indian 0.07 0.06 -0.05 0.18 race: Middle Eastern 0.11 0.08 -0.06 0.27 race: Native American 0.21 0.12 -0.03 0.45 race: Pacific Islander 0.40 0.12 0.16 0.64 $meta N R2 R2 adj. 32537.00 0.03 0.03

The dependent variable was the question above about the number of desired children. Still, we see a small negative coefficient as expected. The size is similar to other studies. Since the cognitive measure was not optimal, it is probably somewhat larger.

Analysis code is in the OSF repository in the *adhoc.R* file.

The dataset reportedly comes from:

- Thompson, A. and Randall-Maciver, R. (1905) Ancient races of the Thebaid, Oxford University Press.

I could not find a copy (checked Gutenberg and Libgen) of this book tho it should be free of copyright by now (!). There is a review of it here.

There are 30 skulls for each time period, 5 periods and 4 measurements.

I calculated means and 95% confidence intervals for each group x measurement combination. It looks like this:

There are no noteworthy changes across about 4000 years aside from perhaps the decrease in basialveolar length, which seems fairly unimportant in relation to brain size.

Factor analysis of the 4 measurements do not reveal a general skull size factor which was found for brain size.

Factor Analysis using method = minres Call: fa(r = d_skulls[-5]) Standardized loadings (pattern matrix) based upon correlation matrix MR1 h2 u2 com maximum_breadth -0.20 3.8e-02 0.96 1 basibregmatic_height 0.33 1.1e-01 0.89 1 basialveolar_length 0.80 6.4e-01 0.36 1 nasal_height 0.00 2.9e-06 1.00 1 MR1 SS loadings 0.79 Proportion Var 0.20

Perhaps the measurements are too unreliable compared to the modern MRI.

**Supplementary materials**

The R code: gist.github.com/Deleetdk/6ee734db2292be83ab3c6a055c9adbf0

The data is loaded from the R package **ade4**.

The authors distinguish between three ways of measuring the amount of missing data: 1) by case, 2) by variable, and 3) by cell. Or in plain English: which proportion of cases/vars/cells have at least one missing variable? *miss_amount* calculates these three proportions:

> miss_amount(iris_miss) cases with missing data vars with missing data cells with missing data 0.42 1.00 0.10

The dataset used, iris_miss, is the famous iris dataset (N=150) where I have removed 10% of the data at random. Luckily, the results tell us that indeed 10% of the cells have missing values. 42% of the cases have at least one missing datapoint.

However, it is usually more interesting to know more than just whether or not there is missing data by case/var, but also how many. *miss_by_case* calculates the number of missing datapoints by case:

> miss_by_case(iris_miss)

[1] 0 1 0 1 0 0 0 0 0 2 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 2 0 0 0 3 0 0 0 1 1 0 0 1 1 1 0 0 1 0 2 1 1 0 1 0 0 0 0 0 3 2 0 [59] 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 4 0 0 [117] 1 0 1 1 1 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1

This is most useful when combined with *table*:

> miss_by_case(iris_miss) %>% table() . 0 1 2 3 4 87 55 5 2 1

Thus, we see that most cases that miss data actually only miss 1 datapoint. One can count the missing datapoints by variable with *miss_by_var*:

> miss_by_var(iris_miss) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 16 14 16 12 17

The above information is most easily understood with the help of a plot, made with *miss_plot*:

> ggsave("miss_plot_case.png") Saving 8.95 x 5.67 in image > miss_plot(iris_miss, case = F) > ggsave("miss_plot_var.png") Saving 8.95 x 5.67 in image

The first has number of missing datapoints by case and the second has by variable.

Statisticians talk about three ways data can be missing that is related to the statistical ‘mechanism’ that caused them to be missing. They have confusing names, so I will just call them type 1, 2 and 3 (MCAR, MAR and MNAR). Type 1 is data that is just missing at random. For instance, if we delete datapoints at random from a dataset as we did above, they are missing at random. Or, if we instead of giving participants all items, we gave them a random subset, they would also be missing at random.

Type 2 is when the data are missing and this is some (possibly probabilistic) function of the other data in the dataset. In this way, the data are missing but not at random. For instance, if younger children tend not to report their own weights because they don’t know them. However, they report their age and the missingness of the weight is related to their age.

Type 3 is when the data are missing but that this is a function of the very same datapoints that are missing. For instance, if we ask people whether they have a criminal record and people with criminal records tend to not want to acknowledge this.

Type 1 missing data we can just ignore. There is no particular pattern in how it is missing. This reduces our sample size, but does not bias the estimates of the relationships in any particular direction. Type 2 missing data, we can perhaps deal with because we can impute it based on the other variable(s) in the dataset. Type 3 missing data, however, there is not much we can directly do about. If we have other datasets with the same variable (e.g. police records), then we can compare the distributions. This would allow us to see that in our sample, the proportion of people with a criminal record is too low, so we suspect that the persons with criminal records tend to not report them.

How can we figure out which type of missingness we have? First, we can create a matrix of the missing data. This is not so useful in itself, but it is useful for further analyses. *miss_matrix* creates this:

> iris_miss[1:10, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 <NA> 3 4.7 3.2 1.3 0.2 setosa 4 4.6 NA 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 NA 3.1 1.5 0.1 <NA> > miss_matrix(iris_miss[1:10, ]) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 0 0 0 0 0 2 0 0 0 0 1 3 0 0 0 0 0 4 0 1 0 0 0 5 0 0 0 0 0 6 0 0 0 0 0 7 0 0 0 0 0 8 0 0 0 0 0 9 0 0 0 0 0 10 1 0 0 0 1

To save save, we only look at the first 10 rows. We see that some data are missing (NA). In the matrix below, we see that these cells have 1’s and the others have 0’s. Simple.

If we inspect the first ten cases with missing data, we can see that they tend to miss data in different ways:

> iris_miss[miss_by_case(iris_miss) >= 1, ] %>% head(10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 2 4.9 3.0 1.4 0.2 <NA> 4 4.6 NA 1.5 0.2 setosa 10 NA 3.1 1.5 0.1 <NA> 12 4.8 3.4 NA 0.2 setosa 14 4.3 3.0 1.1 0.1 <NA> 16 5.7 NA 1.5 0.4 setosa 19 5.7 NA 1.7 0.3 setosa 27 NA 3.4 1.6 NA setosa 31 NA 3.1 NA 0.2 <NA> 35 NA 3.1 1.5 0.2 setosa

We see that case 2 is missing 1 datapoint in the 5th variable, case 4 is missing 1 datapoint in the 2nd variable. Cases 27 and 31 are missing multiple datapoints but in different ways. How can we measure this? We can use the powers of 2 method. We create a vector of powers of 2 that is the same length as the number of variables, so in this case {2, 4, 8, 16, 32}. Then we multiply this vector row-wise with the missing data matrix, and this gives us:

> miss_matrix(iris_miss)[miss_by_case(iris_miss) >= 1, ] %>% head(10) %>% t() %>% multiply_by(c(2, 4, 8, 16, 32)) %>% t() Sepal.Length Sepal.Width Petal.Length Petal.Width Species 2 0 0 0 0 32 4 0 4 0 0 0 10 2 0 0 0 32 12 0 0 8 0 0 14 0 0 0 0 32 16 0 4 0 0 0 19 0 4 0 0 0 27 2 0 0 16 0 31 2 0 8 0 32 35 2 0 0 0 0

Compare this with the missing matrix above. It is the same, except that the cells with 1’s have been replaced with powers of 2 based on the variable they are in. Cells in the first are replaced with 2’s (2^1 = 2), those in the second with 4’s (2^2 = 4) and so on. This instead is not useful, but when we sum the numbers by row, we get a unique ID for each way the data can be missing. These values are calculated by *miss_pattern*:

> miss_pattern(iris_miss) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0 32 0 4 0 0 0 0 0 34 0 8 0 32 0 4 0 0 4 0 0 0 0 0 0 0 18 0 0 0 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 42 0 0 0 2 4 0 0 32 32 8 0 0 32 0 18 2 2 0 2 0 0 0 0 0 52 20 0 8 0 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 0 2 0 2 8 8 32 2 0 8 32 32 0 2 16 0 0 0 2 0 0 0 0 0 0 16 4 0 0 4 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 0 16 0 4 32 8 0 2 2 0 0 8 32 2 16 4 0 0 0 0 32 4 0 60 0 0 8 0 8 4 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 4 0 0 0 0 40 0 0 0 0 0 0 0 32 0 16 8 0 0 0 8 16 0 0 0 8 0 0 0 16

The results given this way are not so understandable, but we can calculate the table from them:

> miss_pattern(iris_miss) %>% table() . 0 2 4 8 16 18 20 32 34 40 42 52 60 87 12 11 13 7 2 1 12 1 1 1 1 1

Now we see it more clearly. These are the patterns of missing data; they ways data can be missing. The powers of 2 are the simple cases of 1 datapoint missing in different variables which we saw earlier. The others are composites. For instance, 18 is 2+16, thus representing cases where data are missing in variable 1 (2) and variable 4 (16).

This provides us with a measure of the complexity of the missing data, namely, we count the number of patterns found in the data and the number of cases. The fraction of this represents the complexity. *miss_complexity* calculates this:

> miss_complexity(iris_miss) [1] 0.04666667

Thus, in our case, the pattern of missing data is relatively simple.

Sometimes we don’t know how the data is missing. It may be type 1, 2, 3, or some combination. How do we find out? We can distinguish between type 1 and 2 by analyzing how the missing data relates to other variables. If the data are missing at random, there should be no observable relationships between data missing in one variable and values of the other variables aside from chance. Examining an entire dataset for non-random missingness thus requires doing a lot of calculations because we must compare the missing data in each variable with each other variable’s data, thus p*(p-1) comparisons. *miss_analyze* does this for us:

> miss_analyze(iris_miss) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length NA 0.06736091 0.41567496 0.46781290 0.40 Sepal.Width -0.02591574 NA -0.08697338 0.02125442 0.01 Petal.Length -0.37577044 0.09577354 NA -0.42815844 0.25 Petal.Width -0.55242759 -0.06717756 -0.40272417 NA 0.15 Species 0.27082506 -0.01113144 0.22816556 0.25833418 NA

Each cell is the effect size (Cohen’s d) calculated as splitting the values of the other variable into 2 groups based on whether they miss data in the first variable. The grouping variable is on the left and the outcome variable is on the top. Thus, we see that there seems to be some patterns in how the data are missing for this dataset. For instance, cases that have missing data for Petal.Length have smaller Sepal.Lengths (d = -38). Since we know this dataset actually has data missing at random, these must represent chance findings. If we look at all the numbers together, we generally do see that the values are smallish.

Note that because one cannot calculate Cohen’s d/correlations for nominal variables (like Species), I have used Cramer’s V converted to Cohen’s d instead. This has the implication that the value cannot be negative.

Our example dataset has N=150. On average, Cohen’s d values we then be calculated from groups with sizes 10 and 140, so if we do 20 comparisons as we did above, we are likely to observe some largeish values, just by chance. The largest value above is .55, representing a medium sized effect. The mean absolute d is .23.

If we examined a larger dataset that also has data missing at random, we should see much smaller values:

> set.seed(1) > rnorm(5e3) %>% matrix(nrow = 1000) %>% as.data.frame() %>% df_addNA() %>% miss_analyze() V1 V2 V3 V4 V5 V1 NA -0.07951442 0.19249439 -0.01830883 -0.11540971 V2 -0.09698717 NA -0.11780557 -0.05403185 0.09778738 V3 0.11954180 0.03310052 NA -0.01917229 0.13087361 V4 -0.06356078 0.12931941 0.02131987 NA 0.26977611 V5 -0.16705158 -0.01179713 0.01402070 0.03015518 NA

Here, we generated a dataset with 1000 cases and 5 variables, then added 10% missing data at random as before. We do see that the values have become much smaller. The mean abs. d is .07.

The functions can be found in my github package: github.com/Deleetdk/kirkegaard

The example code: gist.github.com/Deleetdk/fcfcc6ba6f1c39a234717dc3e0c83369

]]>- Find all the ways, w, to chop up the dataset into 3 groups, each with at least 3 datapoints in it.
- For each w, fit a linear model to each sub-dataset. Calculate the n-weighted R2 for the three fits. Save this.
- Sort the ways of splitting the data by the weighted R2 value.

The reason to have at least 3 datapoints is that linear regression does not work with 2. The reason to use weighted R2s is that otherwise the function would tend to overfit by choosing small pieces with perfect fit.

Does the method work? Well, to find out, I also had to write a test data generator. This will generate data following the above model, and also with specific and predictable slopes for each of the pieces. It can also add noise. For example, here’s an example dataset, colored for convenience.

The generator function always generates data in this format, i.e. where there is a constant increase in the slopes. This is because this is how his data approximately looks like. This kind of data is more difficult to handle. If the data instead had a /\/ format, it would be much easier to find the breakpoints.

Fitting the piecewise linear models to this, we get the following estimated best fits:

Split# |
cutoff_1 |
cutoff_2 |
r2 |

7539 | 39 | 76 | 0.98985 |

6638 | 38 | 67 | 0.989844 |

6639 | 39 | 67 | 0.989829 |

7538 | 38 | 76 | 0.989816 |

6438 | 38 | 65 | 0.98981 |

6439 | 39 | 65 | 0.989761 |

4846 | 46 | 49 | 0.989736 |

6435 | 35 | 65 | 0.989722 |

6538 | 38 | 66 | 0.989706 |

7439 | 39 | 75 | 0.989703 |

So, we see that the R2 values are all nearly the same. Still, the estimates generally agree that the best way to split the data is around value 39 and 67-76. Looking back at the plot, we see that these are about the right values: they are 35 and 68.

split # 4846 is clearly the type of solution I was trying to avoid using the weighted R2s, but apparently not entirely successful.

One problem with the developed method is that it quickly gets very slow. In the case where we have 100 observations and 3 pieces, there are an initial 100^{3} = 10,000 ways to split the data to check. Some of these are impossible because the breaks are in an impossible order: e.g., the breakpoint that separates piece 1 and 2 must be before the breakpoint that separates piece 2 and 3. After removing these, there are 4950 ways to split left. Others are impossible because they result in sub-datasets with fewer than 3 observations. Still, this leaves us with 4278. For each of these, one must fit 3 linear models, so that’s about 13,000 linear models to fit. With parallel processing (from **plyr** package with **doParallel**) this takes about 10 seconds in total. However, it quickly takes much longer if we specify that there are more pieces or more data.

To get around this problem for larger datasets, one could trim the possible splits. Or one could use genetic algorithms to explore the space faster.

Can someone intuitively explain the correlation formula?

I know what the Cov(X,Y) means. It tells you if the relationship between the variables X and Y is positive or negative (although I must admit I dont really know what the actual number means, I only look the the sign). I know what the standard deviation of X & Y means. Its the average distance of every observed variable from its mean. But I cannot for the life of me understand how Cov(X,Y)/stdX*StdY gives a number between -1&1 which tells me the strength of the relationship between X & Y.

When I see the formula Cov(X,Y)/stdX*StdY, I think to myself: “Ok, I’m taking some number, which basically only tells me if the relationship between X & Y is positive or negative, and then dividing that number with the average distance of every observation X from its mean * the average distance of every observation Y from its mean. And this then someone always gives me a number between -1 and 1. I just don’t understand how this makes sense. Can someone try to explain?? Thanks!

This post is a repost of my answer there.

The covariance of X and Y, Cov(X, Y), is expressed in terms of the amount of variation in X and Y. That’s why the number does not make much sense aside from the direction. By dividing by sd(X) and sd(Y), we standardize the relationship, so that it is bounded between 1 and -1.

It is easier to understand the correlation if we standardize the numbers before doing the calculation. The correlation is a measure for how often the two Z scores go in the same direction and to the same relative degree. The correlation does not depend on which units we used.

E.g. suppose we have the data:

X = 5, 10, 15, 10, 5, 10

Y = 1, 2, 4, 1, 1, 3

The standard deviations (which is almost the same as the mean absolute distance to the mean) are:

sd(X) ≈ 3.8

sd(Y) ≈ 1.3

Their covariance [Cov(X, Y]) is 4.

This number does not make much sense, but we see that it is positive, so the linear relationship is positive.

If we divide it by the standard deviations, we get:

Cov(X, Y)/sd(X)*sd(Y) = cor(X, Y) ≈ 0.84

This number is bounded between 1 and -1 because Cov(X,Y) is on the same scale as sd(X)*sd(Y).

To see it more clearly, it is better to standardize the X and Y values first. Then we get:

X_std = -1.11, 0.22, 1.55, 0.22, -1.11, 0.22

Y_std = -0.79, 0.00, 1.58, -0.79, -0.79, 0.79

Now we can more easily see the pattern. The low/high values of X tend to go with the low/high values of Y. The standardization puts them on the same scale.

One can calculate the correlation in a more intuitive way using the Z scores. We begin by multiplying each pair of Z values. Because negative times negative gives positive and so does positive times positive, the numbers that go in the same direction have positive products. Only the numbers that do not go in the same direction have negative products or zero.

The pairwise products are:

X_std*Y_std = 0.88, 0.00, 2.45, -0.18, 0.88, 0.18

Thus, we see that they tend to be positive. If we sum them, we get ≈ 4.2. This number is also not interpretable because it depends on the number of pairs we summed up. Had we used more pairs, the number would have been larger. To standardize the number for this, we divide by the number of pairs-1 (the degree of freedom). We have 6 pairs, so this number is 5. Thus, cor(X, Y) ≈ 4.2/5 ≈ 0.84. The numbers fit.

The job initially seemed easy. If we click on a state and look in the network menu (F12 in Firefox), we see that we received a json file that has the list of all stations in that state. Looks like this (for the smallest ‘state’):

{"metadata":{"resultset":{"offset":1,"count":2,"limit":1000}},"results":[{"elevation":45.7,"mindate":"2010-01-01","maxdate":"2010-12-01","latitude":38.9385,"name":"DALECARLIA RESERVOIR, DC US","datacoverage":1,"id":"GHCND:USC00182325","elevationUnit":"METERS","longitude":-77.1134},{"elevation":15.2,"mindate":"2010-01-01","maxdate":"2010-12-01","latitude":38.9133,"name":"NATIONAL ARBORETUM DC, DC US","datacoverage":1,"id":"GHCND:USC00186350","elevationUnit":"METERS","longitude":-76.97}]}

So, this is easy to parse in R with **jsonlite**. But, suppose we want to download the json for each state automatically (too lazy to click 51 times). We get the URL and try:

But no, it gives us another json file essentially saying “no, no”:

{"status" : "400", "message" : "Token parameter is required."}

But at least it tells us what it wants. A token. Actually, it wants some headers on the request. If we click *copy as cURL,* we get the necessary call:

curl 'http://www.ncdc.noaa.gov/cdo-web/api/v2/stations?limit=1000&offset=1&datasetid=normal_mly&locationid=FIPS%3A11&sortfield=name' -H 'Host: www.ncdc.noaa.gov' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'token: 0x2a' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: www.ncdc.noaa.gov/cdo-web/datatools/normals' -H 'Cookie: JSESSIONID=A94A1B7C0A8FF43E0E43D20F6E4D2F0E; _ga=GA1.2.1535824352.1459267703; _ga=GA1.3.1535824352.1459267703; __utma=198104496.1535824352.1459267703.1459267703.1459527590.2; __utmz=198104496.1459527590.2.2.utmcsr=t.co|utmccn=(referral)|utmcmd=referral|utmcct=/CJ7yyAO3rn; has_js=1; __utmc=198104496; __utmb=198104496.10.9.1459528698517; _gat=1; _gat_GSA_ENOR0=1; __utmt_GSA_CP=1' -H 'Connection: keep-alive'

The first part is the URL we already saw. The others are the headers that my browser sent. So we simply set up a script to get each state’s json using these headers. Then we get a folder full of these state json files. Each state json has the IDs of the stations, e.g. *GHCND:USC00182325*. If we go back to the website and click one station, we can see that we receive a json file in the background:

{"metadata":{"resultset":{"offset":1,"count":20,"limit":100}},"results":[{"date":"2010-01-01T00:00:00","datatype":"ANN-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":4566},{"date":"2010-01-01T00:00:00","datatype":"ANN-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":568},{"date":"2010-01-01T00:00:00","datatype":"ANN-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":671},{"date":"2010-01-01T00:00:00","datatype":"ANN-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":466},{"date":"2010-01-01T00:00:00","datatype":"DJF-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":927},{"date":"2010-01-01T00:00:00","datatype":"DJF-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":365},{"date":"2010-01-01T00:00:00","datatype":"DJF-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"C","value":455},{"date":"2010-01-01T00:00:00","datatype":"DJF-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":274},{"date":"2010-01-01T00:00:00","datatype":"JJA-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1266},{"date":"2010-01-01T00:00:00","datatype":"JJA-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":764},{"date":"2010-01-01T00:00:00","datatype":"JJA-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":867},{"date":"2010-01-01T00:00:00","datatype":"JJA-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":662},{"date":"2010-01-01T00:00:00","datatype":"MAM-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1210},{"date":"2010-01-01T00:00:00","datatype":"MAM-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":556},{"date":"2010-01-01T00:00:00","datatype":"MAM-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":670},{"date":"2010-01-01T00:00:00","datatype":"MAM-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":442},{"date":"2010-01-01T00:00:00","datatype":"SON-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1163},{"date":"2010-01-01T00:00:00","datatype":"SON-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":584},{"date":"2010-01-01T00:00:00","datatype":"SON-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":686},{"date":"2010-01-01T00:00:00","datatype":"SON-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":481}]}

This file has all the data for that station. So, we use the header curl approach from before and simply request all these json files using the IDs in the state files. This gives us some >8000 json files. Then it’s just a matter of parsing these to get what we want. Skipping some parsing headache steps, the end result is a datafile for each station with the following variables:

- longitude
- latitude
- elevation
- name
- annual, winter, spring, summer, autumn:
- precipitation
- mean temperature
- max temperature
- min temperature

To get county-level data, one can assign each station to the nearest county and average the values within each county. There are ~3100 counties but ~8400 stations, so with some luck all counties or nearly so will be covered.

The datafile is available on OSF (*stations_data.csv*). Hopefully, it will be useful to climatologists, meteorologists and sociologists alike.

The R code to scrape with is also in the same repository (*scrape_states.R scrape_stations.R*).

Actually, what I really wanted was another way to estimate county-level IQs, since Add Health refuses to share that data. But before I could do that, I needed to validate the estimates for something else. The scatterplot can be seen below. The NAEP is from Admixture in the Americas, so it is based on a few years of NAEP data.

**R code**

This assumes you have loaded the OKCupid data as d_main and have already calculated the cognitive ability scores.

### VALIDATE STATE-LEVEL IQs # subset data ------------------------------------------------------------- v_2chars = d_main$d_country %>% str_length() < 3 v_notUK = !d_main$d_country %in% c("UK", "GU", "13", NA) d_states = d_main[v_2chars & v_notUK, ] #mean score by d_states = ddply(d_states, .(d_country), .fun = plyr::summarize, IQ = mean(CA, na.rm = T)) rownames(d_states) = d_states$d_country # load comparison data ---------------------------------------------------- #read d_admix = read.csv("data/Data_All.csv", row.names = 1) #subset USA d_admix = d_admix[str_detect(rownames(d_admix), pattern = "USA_"), ] #rownames rownames(d_admix) = str_sub(rownames(d_admix), start = 5) #merge d_states = merge_datasets2(d_states, d_admix) # plot -------------------------------------------------------------------- GG_scatter(d_states, "MeisenbergOCT2014ACH", "IQ") + xlab("NAEP") + ylab("OKCupid IQ") ggsave("figures/state_IQ.png")]]>

Emil:

Dear Add Health,

I could not find an answer to this question anywhere. Suppose I or someone else has calculated an aggregate value (usually a mean) for each county in the US. Would it be against the Add Health data rules to release this non-personal data to other researchers?

Add Health:

Emil,

Thank you for contacting Add Health about this issue. The following guidelines listed in the contract must be followed when publishing information about Add Health.

To avoid inadvertent disclosure of persons, families, or households by using the following guidelines in the release of statistics derived from the Data Files.

- In no table should all cases in any row or column be found in a single cell.
- In no case should the total for a row or column of a cross-tabulation be fewer than three (3).
- In no case should a cell frequency of a cross-tabulation be fewer than three (3) cases.
- In no case should a quantity figure be based on fewer than three (3) cases.
- Data released should never permit disclosure when used in combination with other known data.
Since the mean for a county is likely to be unique you would not be able to satisfy the above conditions. Also, the data may allow users to identify the county by comparing it with the source data. Therefore, you would not be able to release the aggregate information.

I hope this is helpful.

Best,

Joyce Tabor

Add Health Data Manager

Emil:

Hi Joyce,How about adding a random number to the county means and excluding the counties with few cases? In this way, one cannot derive the scores of the individuals in the counties. Alternatively, how about binning the data by rounding the mean score to the nearest e.g. 5 (on a scale with a mean of 100 and SD of 15). This would make the means non-unique.

Add Health:

Emil,

Please provide more details about what you would like to do to help me better understand your project. For example, who do you want to make these data available to? What measures do you want to create at the county level? Would your data be linkable to Add Health data? If so, how would they link?

Best,

Joyce

Emil:

Joyce,Thank you for your reply. Let me clarify matters. I did not do a study using Add Health data. However, some other researchers published a few studies using Add Health data. The studies in question are:

- www.sciencedirect.com/
science/article/pii/ S0160289610001340 - www.sciencedirect.com/
science/article/pii/ S0160289612001201 - www.sciencedirect.com/
science/article/pii/ S019188691300189X They used the Add Health data to calculate an average IQ score for each county with data and then correlated this with other variables. I am unable to gain access to the Add Health data myself, but I work in the same field and would like the IQ scores for my own analyses. I asked the authors to send me the numbers, but they refused on account on the Add Health data sharing rules. However, it is my thinking that the privacy rules of Add Health are there to protect individuals from being deanonymized and so it should be possible to release aggregate-level county data without risking deanonymization. Hence, I wrote to you to investigate whether it would be possible to release the aggregate-level data so that other researchers can use it (open data).

As far as I understand, you had some concerns about this because this world produce unique values by county. I’m not sure I understand why that is a problem, but my proposals to remedy this problem were to either to 1) add a little random noise to the variable, thus obscuring the true values, or 2) bin the datapoints into nearest e.g. 2 or 5 point score (so e.g. 113.1434 gets turned into 115). Both of these obscure the original data but do not cause serious statistical degradation of the data. The methods can also be combined.

I am not interested in the individual-level data for this analysis, so I’m not sure what you mean by making the data linkable to the Add Health data. To make it perfectly clear, I would like for it to be possible to release the mean IQ scores by county with the county names, so that they can be merged with other datasets (e.g. Measure of America’s) for analytic purposes. If it is not possible to release the real means, then it is my hope that we can release the means with some added noise or binning as described above.

Add Health:

Emil,

The researchers are correct that they cannot provide you with data that they constructed using Add Health. Only Add Health can release and share data based on Add Health. Add Health never releases geographic identifiers smaller than Census region, therefore, county level data are not available. The researchers you cite only have access to pseudo county level codes. They do not have the data you want and Add Health will not release any data by location.

Best,

Joyce

So, the Add Health data sharing rules are extremely strict making the dataset much less useful. Some other way to estimate county-level IQ must be found.

]]>I figure it should be easy to find someone who wrote a ready-made function to plot contingency matrices with ggplot2. Not so apparently!

However, it is a relatively simple matter. So here’s one:

Features:

- Easily plot a decent contingency table with ggplot2. Just input the datafile and the names of the two variables.
- Can also calculate marginal values, which is to say that we set either all rows or columns to be 1.
- Automatically sets the axis labels to the names you used, which is often enough.

library(devtools); install_github("deleetdk/kirkegaard"); library(kirkegaard) GG_contingency_table(mpg, "drv", "cyl")

GG_contingency_table(mpg, "drv", "cyl", margin = 1)

GG_contingency_table(mpg, "drv", "cyl", margin = 2)

]]>

As explored in some previous posts, John Fuerst and I have spent about 1.25 years (!) producing a massive article: published version runs 119 pages; 25k words without the references; 159k characters incl. spaces. We received 6 critical comments by other scholars, to which we also produced a 57-page reply with new analyses. The first article was chosen as a target article in *Mankind Quarterly * and I recommend reading all the papers. Unfortunately, they are behind a paywall, except for ours: