{"id":5574,"date":"2015-10-26T17:28:43","date_gmt":"2015-10-26T16:28:43","guid":{"rendered":"http:\/\/emilkirkegaard.dk\/en\/?p=5574"},"modified":"2015-10-28T20:22:53","modified_gmt":"2015-10-28T19:22:53","slug":"polygenic-traits-and-the-distribution-of-effect-sizes-years-of-education-from-rietveld-et-al-2013","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2015\/10\/polygenic-traits-and-the-distribution-of-effect-sizes-years-of-education-from-rietveld-et-al-2013\/","title":{"rendered":"Polygenic traits and the distribution of effect sizes: years of education from Rietveld et al (2013)"},"content":{"rendered":"<p>It is often said that polygenic traits are based on tons of causal variants each of which has a very small effect size. What is less often discussed is the distribution of these effect sizes, although this has some implications.<\/p>\n<p>The first statistical importance is that we may want to modify our <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hyperprior\">hyperprior<\/a> if using a Bayesian approach. I&#8217;m not sure what the equivalent solution would be using a frequentist approach. I suspect the Frequentist approach is based on assuming a normal distribution of the effects we are looking at and then testing them against the null hypothesis, i.e. looking at p values. Theoretically, the detection of SNPs may improve if we use an appropriate model.<\/p>\n<p>The second implication is that to find even most of them, we need <em>very, very large samples<\/em>. The smaller effects probably can never be found because there are too few humans around to sample! Their <a href=\"https:\/\/en.wikipedia.org\/wiki\/Signal-to-noise_ratio\">signals are too weak in the noise<\/a>. One could get around this by increasing the human population or simply collecting data over time as some humans die and new ones are born. Both have problems.<\/p>\n<p><strong>But just how does the distribution of betas look like?<\/strong><\/p>\n<p>However, based on the current results, just how does the distribution looks like? To find out, I downloaded <a href=\"http:\/\/www.thessgac.org\/#!data\/kuzq8\">the supplementary materials<\/a> from <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC3751588\/\">Rietveld et al (2013)<\/a>. I used the EduYears one because college is a dichotomized version of this and <a href=\"http:\/\/emilkirkegaard.dk\/understanding_statistics\/?app=dichotomization_cutoff\">dichotomization is bad<\/a>. The datafile contains the SNP name (rs-number), effect allele, EAF (&#8220;frequency of the effect allele from the HapMap2-CEU sample&#8221;), beta, standard error and p value for each of the SNPs they examined, N=2.3 x 10<sup>6<\/sup>.<\/p>\n<p>From these values, we calculate the absolute beta because we are interested in effect size, but not direction. Direction is irrelevant because one could just &#8216;reverse&#8217; the allele.<\/p>\n<p>One can plot the data in various ways. Perhaps the most obvious is a histogram, shown below.<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/Rietveld_et_al_beta_histogram.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5575\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/Rietveld_et_al_beta_histogram-1024x662.png\" alt=\"Rietveld_et_al_beta_histogram\" width=\"720\" height=\"465\" \/><\/a><\/p>\n<p>We see that most SNPs have effect sizes near zero. Another way is to cut the betas into k bins, calculate the midpoint of each bin and the number of betas in them.<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_linear.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5602\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_linear-1024x662.png\" alt=\"cut_20_beta_N_linear\" width=\"720\" height=\"465\" \/><\/a><\/p>\n<p>The result is fairly close to the histogram above. It is clear that this is not linear. One can&#8217;t even see the difference between the numbers for about half the bins. We can fix this by using logscale for the y-axis:<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_log.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5580\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_log-1024x662.png\" alt=\"cut_20_beta_N_log\" width=\"720\" height=\"465\" \/><\/a><\/p>\n<p>We get the expected fairly straight line. It is however not exactly straight. Should it be? Is it a fluke? How do we quantify straightness\/linearity?<\/p>\n<p>Perhaps if we increase our resolution, we would see something more. Let&#8217;s try 50 bins:<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/50_cuts_beta_N_linear.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5578\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/50_cuts_beta_N_linear-1024x662.png\" alt=\"50_cuts_beta_N_linear\" width=\"720\" height=\"465\" \/><\/a><\/p>\n<p>Now we get a bizarre result. Some of them are empty! Usually this means sampling, coding, or data error. I checked and could not find a problem on my end and it is not sampling error for the smaller betas. Perhaps they used some internal rounding system that prevents betas in certain regions. It is pretty weird. Here&#8217;s how the table output looks like:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">table(r$cut_50)\r\n<\/span>\r\n(-3.5e-05,0.0007]   (0.0007,0.0014]   (0.0014,0.0021]   (0.0021,0.0028]   (0.0028,0.0035]   (0.0035,0.0042] \r\n           174315            340381            321445                 0            292916            258502 \r\n  (0.0042,0.0049]   (0.0049,0.0056]   (0.0056,0.0063]    (0.0063,0.007]    (0.007,0.0077]   (0.0077,0.0084] \r\n                0            217534            177858            139775                 0            107282 \r\n  (0.0084,0.0091]   (0.0091,0.0098]   (0.0098,0.0105]   (0.0105,0.0112]   (0.0112,0.0119]   (0.0119,0.0126] \r\n            80258                 0             58967             42998                 0             30249 \r\n  (0.0126,0.0133]    (0.0133,0.014]    (0.014,0.0147]   (0.0147,0.0154]   (0.0154,0.0161]   (0.0161,0.0168] \r\n            21929             14894                 0              9733              6899                 0 \r\n  (0.0168,0.0175]   (0.0175,0.0182]   (0.0182,0.0189]   (0.0189,0.0196]   (0.0196,0.0203]    (0.0203,0.021] \r\n             4757              3305                 0              2535              1322               912 \r\n   (0.021,0.0217]   (0.0217,0.0224]   (0.0224,0.0231]   (0.0231,0.0238]   (0.0238,0.0245]   (0.0245,0.0252] \r\n                0               502               319                 0               174               133 \r\n  (0.0252,0.0259]   (0.0259,0.0266]   (0.0266,0.0273]    (0.0273,0.028]    (0.028,0.0287]   (0.0287,0.0294] \r\n                0                85                47                33                 0                14 \r\n  (0.0294,0.0301]   (0.0301,0.0308]   (0.0308,0.0315]   (0.0315,0.0322]   (0.0322,0.0329]   (0.0329,0.0336] \r\n                5                 0                 4                 2                 0                 1 \r\n  (0.0336,0.0343]    (0.0343,0.035] \r\n                1                 1<\/pre>\n<p>Thus we see that some of them are inexplicably empty. Why are there no betas with values between .0021 and .0028?<\/p>\n<p>We can try investigating some other number of cuts. I tried 10, 20, 30, 40 and 50. Only 40 and 50 have the problem. 30 is fine:<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_log.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5580\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cut_20_beta_N_log-1024x662.png\" alt=\"cut_20_beta_N_log\" width=\"720\" height=\"465\" \/><\/a><\/p>\n<p>The pattern at the 50% higher resolution (30\/20=1.5) is still somewhat curved, although probably not with a low p value.<\/p>\n<p><strong>Frequency-corrected betas?<\/strong><\/p>\n<p>An idea I had while writing this post. Correlations and other linear modeling is affected by base rates as well as betas. Unless they corrected for this (I don&#8217;t remember), then some of the SNPs with lower betas probably have stronger betas but they appear to be weak because their base rates are too high or too low. One could correct for this <a href=\"http:\/\/emilkirkegaard.dk\/understanding_statistics\/?app=restriction_of_range\">restriction of range<\/a> if desired which may change conclusions somewhat. What this would do is to estimate the betas of the SNPs if they all had the same frequency.<\/p>\n<p>Is there support for this idea? A simple test is to correlate frequency with absolute beta. This value should be negative. It is: r = -.006 [CI95: -.007 to -.005].<\/p>\n<p><strong>R code<\/strong><\/p>\n<pre># IO and libs -------------------------------------------------------------\r\nlibrary(pacman)\r\np_load(stringr, kirkegaard, psych, plyr, ggplot2)\r\n\r\n#load data\r\nr = read.table(\"SSGAC_EduYears_Rietveld2013_publicrelease.txt\", sep = \"\\t\", header = T)\r\n\r\n\r\n# calculations ------------------------------------------------------------\r\n#absolute values\r\n#since we dont care about direction\r\nr$Abs_Beta =\u00a0 abs(r$Beta)\r\n\r\n#find cut midpoints\r\n#feature is missing\r\n#http:\/\/www.r-bloggers.com\/finding-the-midpoint-when-creating-intervals\/\r\nmidpoints &lt;- function(x, dp=2){\r\n\u00a0 lower &lt;- as.numeric(gsub(\",.*\", \"\", gsub(\"\\\\(|\\\\[|\\\\)|\\\\]\", \"\", x)))\r\n\u00a0 upper &lt;- as.numeric(gsub(\".*,\" , \"\", gsub(\"\\\\(|\\\\[|\\\\)|\\\\]\", \"\", x)))\r\n\u00a0 return(round(lower+(upper-lower)\/2, dp))\r\n}\r\n\r\n#make new dfs\r\ncut_vec = c(10, 20, 30, 40, 50)\r\nd_list = llply(cut_vec, function(x) {\r\n\u00a0 #add cuts to r\r\n\u00a0 tmp_var = str_c(\"cut_\", x)\r\n\u00a0 r[tmp_var] = cut(r$Abs_Beta, breaks = x)\r\n\u00a0 \r\n\u00a0 #make a new df based of the table\r\n\u00a0 data.frame(N = table(r[[tmp_var]]) %&gt;% as.numeric,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 midpoint = table(r[[tmp_var]]) %&gt;% names %&gt;% midpoints(., dp = 99))\r\n}, .progress = \"text\")\r\nnames(d_list) = str_c(\"cut_\", cut_vec) #add names\r\n\r\n# plots --------------------------------------------------------------------\r\nggplot(r, aes(Abs_Beta)) + geom_histogram() + xlab(\"Absolute beta coefficient\")\r\nggsave(\"beta_histogram.png\")\r\n\r\n#loop plot\r\nfor (i in seq_along(d_list)) {\r\n\u00a0 #fetch data\r\n\u00a0 tmp_d = d_list[[i]]\r\n\u00a0 \r\n\u00a0 #linear\r\n\u00a0 ggplot(tmp_d, aes(midpoint, N)) + geom_point() + geom_smooth() + ylab(\"Number of SNPs\") + xlab(\"Midpoint of range\")\r\n\u00a0 name = str_c(names(d_list)[i], \"_beta_N_linear.png\")\r\n\u00a0 ggsave(name)\r\n\u00a0 \r\n\u00a0 #log\r\n\u00a0 try({ #we try because log transformation can give an error\r\n\u00a0\u00a0\u00a0 ggplot(tmp_d, aes(midpoint, N)) + geom_point() + geom_smooth() + ylab(\"Number of SNPs\") + xlab(\"Midpoint of range\") + scale_y_log10() + geom_smooth(method = \"lm\", se = F, color = \"red\")\r\n\u00a0\u00a0\u00a0 name = str_c(names(d_list)[i], \"_beta_N_log.png\")\r\n\u00a0\u00a0\u00a0 ggsave(name)\r\n\u00a0 })\r\n\u00a0 \r\n}\r\n\r\n\r\n# investigate -------------------------------------------------------------\r\ntable(r$cut_50)<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>It is often said that polygenic traits are based on tons of causal variants each of which has a very small effect size. What is less often discussed is the distribution of these effect sizes, although this has some implications. The first statistical importance is that we may want to modify our hyperprior if using [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1839,1690,2591],"tags":[2245,2234,1979],"class_list":["post-5574","post","type-post","status-publish","format-standard","hentry","category-psychometics","category-genetics","category-intelligence-iq-cognitive-ability","tag-distribution","tag-polygenic","tag-r","entry"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5574","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=5574"}],"version-history":[{"count":4,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5574\/revisions"}],"predecessor-version":[{"id":5603,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5574\/revisions\/5603"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=5574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=5574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=5574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}