Categories

Regression Modeling Strategies (2nd ed.) – Frank Harrell (review)

I heard some good things about this book, and some of it is good. Surely, the general approach outlined in the introduction is pretty sound. He sets up the following principles:

1. Satisfaction of model assumptions improves precision and increases statistical power.
2. It is more productive to make a model fit step by step (e.g., transformation estimation) than to postulate a simple model and find out what went wrong.
3. Graphical methods should be married to formal inference.
4. Overfitting occurs frequently, so data reduction and model validation are important.
5. In most research projects, the cost of data collection far outweighs the cost of data analysis, so it is important to use the most efficient and accurate modeling techniques, to avoid categorizing continuous variables, and to not remove data from the estimation sample just to be able to validate the model.
6. The bootstrap is a breakthrough for statistical modeling, and the analyst should use it for many steps of the modeling strategy, including derivation of distribution-free confidence intervals and estimation of optimism in model fit that takes into account variations caused by the modeling strategy.
7. Imputation of missing data is better than discarding incomplete observations.
8. Variance often dominates bias, so biased methods such as penalized maximum likelihood estimation yield models that have a greater chance of accurately predicting future observations.
9. Software without multiple facilities for assessing and fixing model fit may only seem to be user-friendly.
10. Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one.
11. Methods that work for all types of regression models are the most valuable.
12. Using the data to guide the data analysis is almost as dangerous as not doing so.
13. There are benefits to modeling by deciding how many degrees of freedom (i.e., number of regression parameters) can be “spent,” deciding where they should be spent, and then spending them.

Readers will recognize many of these from my writings. Not mentioned in the principles is that the book takes a somewhat anti-P value stance (roughly ‘they have some uses but are widely misused, so beware!’), and pro effect size estimation stance. And some the chapters do seem to follow these principles, but IMO the majority of the book does not really follow it. Mostly it is about endless variations on testing for non-linear effects of predictors, whereas in real life a lot of predictors will be boringly linear. There’s some decent stuff about overfitting, bootstrapping and penalized regression, but they have been done better already (read Introduction to Statistical Learning). I did learn some new things, including on the applied side (e.g. the ease of applying cubic splines, something that would have been useful for this study), and the book comes with a complimentary R package (rms) so one can apply the ideas to one’s own research immediately. On the other hand, most of the graphics in the book are terrible base plot ones, and only some are ggplot2.

This edition needed (came out 2015, first edition 2001) more work before it should have been published, but it is still worth reading for people with an interest in post-replication crisis statistics. Frank Harrell should team up with Andy Field, who’s a much better writer, and with someone with good ggplot2 skills (throw in Shiny too for extra quality). Then they could write a really good stats book.

Categories

Do you hate Python’s range() and want to use R’s seq()?

I do. So I implemented R’s function in Python.

I wrote a bunch of tests for this for most uses I could think of and they all work. Except the length param, which I did not implement so far. Did not need it. It can be implemented by simply generating the desired number of natural numbers beginning from 1, and then rescaling this to the desired range.

#R style seq()
def seq(from_ = None, to = None, by = None, length = None, along = None):
#init
falling = False

#length
if not length is None:
#TODO: implement when needed
raise ValueError("length is not implemented yet")

#along object
if not along is None:
return list(range(1, len(along) + 1))

#if all numeric None, set defaults
if from_ is None and to is None and by is None:
to = 1
from_ = 1

#1 argument
if not from_ is None and to is None and by is None:
#this is actually to
to = from_
from_ = 1

if from_ > to:

#set by
if by is None:
by = -1
else:

#set by
if by is None:
by = 1

#impossible by
if from_ > to and by > 0:
raise ValueError("by cannot be positive when falling direction")
if from_ < to and by < 0:
raise ValueError("by cannot be negative when increasing direction")
if by is 0:
raise ValueError("by cannot be zero")

#from, to
if not from_ is None and not to is None:

return list(range(from_, to + adjustment, by))

#if you get here, something is wrong!

#convenience wrapper
def seq_along(x):
return(seq(along = x))

Categories

Do music genres exist? An outline of an empirical approach

I don’t have time to do this project right now. However, since I gathered a bunch of relevant resources that I don’t want to have to re-find, I will write them down somewhere — here. My blog is my extended memory.

Music genres are another fuzzy concept, like races. Lots of genres (and classification systems exist), but genres seem fairly arbitrary and their usage also seems fairly inconsistent. However, it is possibly to clean up the mess somewhat by using statistical methods to find patterns of similarity in music.

Recently, someone asked on Reddit about how to do something like this in R. My answer will serve as this post.

Essentially what one needs to find/invent are numeric measures of music that vary between songs and which is related to genres. Some immediate ideas:

• Beats per minute (bpm): overall and variation.
• Dynamic range.
• Length.
• Relative presence of vocals.
• Repetitivity.
• Frequency: overall and variation.
• Extract melodies. These can be analyzed extensively.

With these data in hand, you can use cluster analysis or dimensional reduction (factor analysis/pca) methods to try to infer clusters of similar music. One needs to use a large collection of varied music for this to work. E.g. 1k or 10k songs.

Some previous literature on the topic:

For validation, one can look up the same songs on LastFM to get their tags. I would also look for artist and album-artist effects, as generally speaking songs from one artist sound alike, and especially so when they are on the same album.

R packages for sound analysis

The tuneR library has been used for clustering before:

http://www.vesnam.com/Rblog/sortmymusic/

The seewave package is pretty comprehensive.

http://rug.mnhn.fr/seewave/

The signal package probably has some useful tools.

https://cran.r-project.org/web/packages/signal/index.html

Categories

Renaming functions in R packages using roxygen2

Naming things in programming is hard. I can see 2 reasons for this. 1) It is hard to pick a name you won’t want to change later. With large projects, where things are not designed in detail before beginning to write the code, I often find that the names I gave things initially were not quite good and later needed to be replaced so I could remember what my own functions or objects did. Renaming things can be difficult, but generally involves doing a lot of search-replace on the codebase. If there’s databases too, then it’s even more complicated. 2) The longer the name, the more descriptive it can be, but the longer it takes to type. Even with auto-completion this adds to the writing time and to hand fatigue. A user on quara put it well “The number of things in computer science is hugely greater than the number of words.”.

Naming things in R is even harder than usual because R attaches every function to the global namespace unlike e.g. Python and every other language I’ve used. The global namespace means that your functions should have names not found in other packages to avoid conflict. Given that there are some 12,000 R packages as of writing, this can be a little difficult. Name conflict leads to a number of difficult to recognize problems such as this one.

Often I find that I have created a number of functions that are similar and so it makes sense to give them similar names. A growing number of R packages use prefixes to avoid the namespace problem: stringr has str_, pacman has p_, html_ in rvest. Hadley’s packages often have functions that have recognizable patterns in their names without being strict prefixes: write_* and read_* in readr, **ply in plyr.

So what do you do when you have written a package, used functions from that package in many other projects (such that the search-replacement strategy would be cumbersome), and now find a strong desire to rename a function?

One option is to keep all the function names working with everything. I did this using roxygen2 by using the export and alias tags:

#' Add together two numbers
#'
#' @param x A number
#' @param y A number
#' @return The sum of \code{x} and \code{y}
#' @alias old_name
#' @examples
x + y
}


This clutters the auto-completion feature, the function documentation and did not inform the user that there was a problem.

I rather have users change to the new names so I can keep the documentation clean. I wrote a simple function generator:

#function generator
defunct = function(msg = "This function is depreciated") function(...) return(stop(msg))

So every time you want to rename a function, simply put a line like this in an R file in the package:

#' @export
old_name = defunct("old_name changed name to new_name")

Users who try to use the function thus get an informative and easily actionable error. The … input of the function means that there will not be an error from the arguments no matter which are passed. The function has no documentation so it does not cause clutter. It still shows up in auto-completion, but fixing that would be more difficult.

Categories

PSA: bugfix for the scoreIrt function in psych package, you should re-do your analyses!

I was working on a study and while doing so, I updated all my R packages. To my surprise, this changed the results in my analysis to a degree that it caused a change in the conclusion. I suspected it was due to an update in the psych package. The update notes read:

Bug fixed in scoreIrt where the values for the scores for people who gave all of the highest possible scores was incorrect (reported by Roland Leeuwen ). In addition, the scoreIrt.poly was not properly doing normal scoring but was in fact just doing logistic scoring. Fixed. This bug was affecting those people with max or min responses, and thus was particularly a problem for short scales.

So, if you used a scale with few items (e.g. ICAR5) and some persons obtained 0 or 5 correct answers, their scores will have been estimated incorrectly which may heavily distort your results. Bill Revelle (package author) notes that this was particularly troubling for himself because he often used short scales.

On a meta-science level, I appreciate the open reporting of troubling errors. This makes it possible to identify sources of inconsistent results between studies, or lack of ability to reproduce results even with the same data. It is troubling to think about how errors in statistical software affect science and we generally how no idea about the extent of this problem.

Categories

SQL server for population frequencies from 1000 genomes

Note: 2018 June 26

Server down right now, investigating.

Note: August 16, 2017

server IP changed to 67.207.92.10.

Original post

We need dplyr for this:

library(dplyr)

First, use the anon user to log into the SQL server (user = “anon”, pass = “”, ip = “67.207.92.10”, port = 3306):

sql = src_mysql("population_freqs", host = "67.207.92.10", user = "anon", port = 3306)

Select the 1000 genomes phase 3 table:

sql_1kg = tbl(sql, "1000genomes_phase3")

#look at the first 10 rows
sql_1kg
## Source:   query [?? x 35]
##
##      CHR       SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>     <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1 rs1000033     G     T 0.4062 0.3279 0.2151 0.3172 0.2071 0.3107
## 2      1 rs1000050     C     T 0.6302 0.5246 0.2267 0.4247 0.1364 0.3932
## 3      1 rs1000070     T     C 0.6146 0.5902 0.5116 0.4946 0.2828 0.3204
## 4      1 rs1000073     G     A 0.4219 0.3689 0.2093 0.1989 0.6717 0.1456
## 5      1 rs1000075     T     C 0.3750 0.4180 0.2733 0.3172 0.3586 0.2573
## 6      1 rs1000085     C     G 0.0781 0.1393 0.0349 0.0645 0.2121 0.0437
## 7      1 rs1000127     C     T 0.2500 0.2787 0.4012 0.4677 0.3081 0.5728
## 8      1 rs1000184     C     G 0.0521 0.0574 0.2442 0.8333 0.2879 0.7767
## 9      1 rs1000211     T     C 0.0260 0.0246 0.0000 0.0000 0.0000 0.0000
## 10     1 rs1000212     A     G 0.0365 0.0246 0.0000 0.0000 0.0000 0.0000
## # ... with more rows, and 25 more variables: CHS <dbl>, CLM <dbl>,
## #   ESN <dbl>, FIN <dbl>, GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>,
## #   ITU <dbl>, JPT <dbl>, KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>,
## #   PEL <dbl>, PJL <dbl>, PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>,
## #   EAS <dbl>, EUR <dbl>, AFR <dbl>, AMR <dbl>, SAS <dbl>

The entire file is really large, about 3.6 GB in memory. You often only need a few (1-1000) SNPs, so let’s try downloading only a few:

#first 10 of the hits from the latest height GWAS
#http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html
some_snps = c("rs425277", "rs9434723", "rs10779751", "rs2284746", "rs12137162",
"rs212524", "rs1014987", "rs2806561", "rs4601530", "rs926438")

#fetch from SQL server
(sql_height_freqs = sql_1kg %>% filter(SNP %in% some_snps) %>% collect())
## # A tibble: 10 x 35
##      CHR        SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>      <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1  rs1014987     G     C 0.1458 0.1475 0.1802 0.5000 0.2525 0.4612
## 2      1 rs10779751     A     G 0.6406 0.5984 0.1279 0.0591 0.2677 0.0631
## 3      1 rs12137162     A     C 0.1198 0.1557 0.2384 0.1935 0.2677 0.2621
## 4      1   rs212524     T     C 0.1823 0.1967 0.3605 0.2473 0.4293 0.1796
## 5      1  rs2284746     G     C 0.1927 0.1475 0.4012 0.1882 0.5657 0.2573
## 6      1  rs2806561     A     G 0.3073 0.3934 0.3488 0.5591 0.5909 0.5388
## 7      1   rs425277     T     C 0.0990 0.0984 0.3023 0.1452 0.2576 0.1990
## 8      1  rs4601530     T     C 0.4271 0.3852 0.4535 0.5538 0.2475 0.4515
## 9      1   rs926438     T     C 0.8177 0.7541 0.2093 0.1774 0.5455 0.3010
## 10     1  rs9434723     A     G 0.2240 0.2541 0.1744 0.0538 0.1414 0.0922
## # ... with 25 more variables: CHS <dbl>, CLM <dbl>, ESN <dbl>, FIN <dbl>,
## #   GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>, ITU <dbl>, JPT <dbl>,
## #   KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>, PEL <dbl>, PJL <dbl>,
## #   PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>, EAS <dbl>, EUR <dbl>,
## #   AFR <dbl>, AMR <dbl>, SAS <dbl>

All the atomic populations are there as well as the 5 super populations (‘macro races’). The numbers for the super populations differ slightly from those that can be seen on ensembl because they used weighted means and I used unweighted means.

Categories

ggplot2 functions in kirkegaard package

http://rpubs.com/EmilOWK/215493

Categories

Categories

Cognitive ability and political preferences in Denmark: knitr edition

http://rpubs.com/EmilOWK/208757

Analysis of the data collected so far, presented side by side with R code. It will be expanded into a proper paper and submitted to OQSPS soonish. Soonish here meaning when Noah gets around to do it!

Due to the surprising results, we should probably do a follow-up replication using new subjects. These results are hard to believe in the light of earlier findings from other countries based on larger samples etc.

Summary for the less math inclined:

• Despite virtually every other published study, we did not find much correlation between preferences for personal and economic freedom and cognitive ability: r’s about .05.
• Self-rated preference for personal freedom correlated very poorly with measured preferences, r≈.10, but for economic freedom, the correlation was alright r≈.50. This means one can use the latter as an okay proxy, but not the first. People don’t seem to have an idea of how important they actually think personal freedoms are relative to other people.
• There was little to none correlation between the political axes, r≈.10. This means that one cannot summarize people’s political preferences using only 1 dimension such as the popular left-wing axis.
• Using people’s self-rated agreement with parties, one can estimate the political positions of the parties. Doing so using people’s self-rated preferences recreated the familiar left-right economic axes, but not the personal freedom axes, since all parties were about equally in favor of personal freedoms. However, using measured preferences, the left-economic parties were found to be less in favor of personal freedoms than the right-economic parties. The result is that the party political dimensions are highly correlated (about r≈.80) and so one can summarize the parties using a 1-dimensional model.
Categories

Installing the gsl package for R on Linux Mint 18

Today I was trying to install the MBESS package. One of its dependencies is the gsl package. However, trying to install that package gave the following error:

cp: cannot stat '*.gcno': No such file or directory
Makevars:45: recipe for target 'save-gcno' failed
make: [save-gcno] Error 1 (ignored)

Googling it gave 0 results, which is a bad sign, because most bugs we solve by googling the solution that other people who already solved it posted. However, in this case, the solution is simple:

sudo apt-get install r-cran-gsl