SQL server for population frequencies from 1000 genomes

Note: 2018 June 26

Server down right now, investigating.

Note: August 16, 2017

server IP changed to 67.207.92.10.

Original post

We need dplyr for this:

library(dplyr)

First, use the anon user to log into the SQL server (user = “anon”, pass = “”, ip = “67.207.92.10”, port = 3306):

sql = src_mysql("population_freqs", host = "67.207.92.10", user = "anon", port = 3306)

Select the 1000 genomes phase 3 table:

sql_1kg = tbl(sql, "1000genomes_phase3")

#look at the first 10 rows
sql_1kg
## Source:   query [?? x 35]
## Database: mysql 10.0.27-MariaDB-0ubuntu0.16.04.1 [anon@67.207.92.10:/population_freqs]
## 
##      CHR       SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>     <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1 rs1000033     G     T 0.4062 0.3279 0.2151 0.3172 0.2071 0.3107
## 2      1 rs1000050     C     T 0.6302 0.5246 0.2267 0.4247 0.1364 0.3932
## 3      1 rs1000070     T     C 0.6146 0.5902 0.5116 0.4946 0.2828 0.3204
## 4      1 rs1000073     G     A 0.4219 0.3689 0.2093 0.1989 0.6717 0.1456
## 5      1 rs1000075     T     C 0.3750 0.4180 0.2733 0.3172 0.3586 0.2573
## 6      1 rs1000085     C     G 0.0781 0.1393 0.0349 0.0645 0.2121 0.0437
## 7      1 rs1000127     C     T 0.2500 0.2787 0.4012 0.4677 0.3081 0.5728
## 8      1 rs1000184     C     G 0.0521 0.0574 0.2442 0.8333 0.2879 0.7767
## 9      1 rs1000211     T     C 0.0260 0.0246 0.0000 0.0000 0.0000 0.0000
## 10     1 rs1000212     A     G 0.0365 0.0246 0.0000 0.0000 0.0000 0.0000
## # ... with more rows, and 25 more variables: CHS <dbl>, CLM <dbl>,
## #   ESN <dbl>, FIN <dbl>, GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>,
## #   ITU <dbl>, JPT <dbl>, KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>,
## #   PEL <dbl>, PJL <dbl>, PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>,
## #   EAS <dbl>, EUR <dbl>, AFR <dbl>, AMR <dbl>, SAS <dbl>

The entire file is really large, about 3.6 GB in memory. You often only need a few (1-1000) SNPs, so let’s try downloading only a few:

#first 10 of the hits from the latest height GWAS
#http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html
some_snps = c("rs425277", "rs9434723", "rs10779751", "rs2284746", "rs12137162", 
"rs212524", "rs1014987", "rs2806561", "rs4601530", "rs926438")

#fetch from SQL server
(sql_height_freqs = sql_1kg %>% filter(SNP %in% some_snps) %>% collect())
## # A tibble: 10 x 35
##      CHR        SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>      <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1  rs1014987     G     C 0.1458 0.1475 0.1802 0.5000 0.2525 0.4612
## 2      1 rs10779751     A     G 0.6406 0.5984 0.1279 0.0591 0.2677 0.0631
## 3      1 rs12137162     A     C 0.1198 0.1557 0.2384 0.1935 0.2677 0.2621
## 4      1   rs212524     T     C 0.1823 0.1967 0.3605 0.2473 0.4293 0.1796
## 5      1  rs2284746     G     C 0.1927 0.1475 0.4012 0.1882 0.5657 0.2573
## 6      1  rs2806561     A     G 0.3073 0.3934 0.3488 0.5591 0.5909 0.5388
## 7      1   rs425277     T     C 0.0990 0.0984 0.3023 0.1452 0.2576 0.1990
## 8      1  rs4601530     T     C 0.4271 0.3852 0.4535 0.5538 0.2475 0.4515
## 9      1   rs926438     T     C 0.8177 0.7541 0.2093 0.1774 0.5455 0.3010
## 10     1  rs9434723     A     G 0.2240 0.2541 0.1744 0.0538 0.1414 0.0922
## # ... with 25 more variables: CHS <dbl>, CLM <dbl>, ESN <dbl>, FIN <dbl>,
## #   GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>, ITU <dbl>, JPT <dbl>,
## #   KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>, PEL <dbl>, PJL <dbl>,
## #   PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>, EAS <dbl>, EUR <dbl>,
## #   AFR <dbl>, AMR <dbl>, SAS <dbl>

All the atomic populations are there as well as the 5 super populations (‘macro races’). The numbers for the super populations differ slightly from those that can be seen on ensembl because they used weighted means and I used unweighted means.