Clear Language, Clear Mind

November 28, 2016

Installing the latest version of R on Ubuntu/Mint

Filed under: Programming — Tags: , , , , — Emil O. W. Kirkegaard @ 17:56

I wrote about this before, but since this is a frequent problem and my last post wasn’t brief, here’s a shorter version.

cautionary

The primary way to install software in Linux is to rely on apt-get (apt in Mint) or some other package manager. The way this works is that there is a central server which holds a list of all the software available for that version of Linux you installed. The problem is that your installed version of Linux is based on an older list of this software and thus when you try to update, it says there isn’t any newer version available when there actually is.

To solve this, you have to add more lists of software so Linux knows where to look for the newer versions of the software. For R, there’s a list of mirrors here and all you have to do is edit the /etc/apt/sources.list file to include something like this:

deb http://cloud.r-project.org/bin/linux/ubuntu xenial/

The name at the end corresponds to your Ubuntu version. You can look here to find which name to use. It’s always the first word in the name. The first part is one of the CRAN mirrors of which there is a list.

After this you update the list of software by:

apt-get update

This may work, or it may not (bug fixing in Linux is inherently stochastic). If you get an error like this:

W: GPG error: http://cloud.r-project.org/bin/linux/ubuntu xenial/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9
E: The repository 'http://cloud.r-project.org/bin/linux/ubuntu xenial/ InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Then run this:

apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9

Notice the number as the end. You must copy the one from the error message. When you run the above, you should get something like:

Executing: /tmp/tmp.nyQRPoVeus/gpg.1.sh --keyserver
keyserver.ubuntu.com
--recv-keys
51716619E084DAB9
gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
gpg: key E084DAB9: public key "Michael Rutter <marutter@gmail.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)

If so, then you’re good to go. Then you update the list of software and update R:

apt-get update
apt-get install r-base r-base-dev

 

August 10, 2016

Installing R packages on Mint 17.1: cluster, KernSmooth, Matrix, nlme, mgcv

Filed under: Computer science — Tags: , , — Emil O. W. Kirkegaard @ 20:24

Installing R packages on Windows is easy: you run the install code and it always works. Not so on Linux! Here one sometimes has to install them thru apt-get (or whatever package manager) or install some missing system-level dependencies. Finding what to do can take a lot of time gooling and trial and error. So, in the spirit of contribuing to the internet about how to get stuff to work.

1: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘cluster’ had non-zero exit status
2: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘KernSmooth’ had non-zero exit status
3: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘Matrix’ had non-zero exit status
4: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘nlme’ had non-zero exit status
5: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘mgcv’ had non-zero exit status

These all installed for me after installing:

sudo apt-get install r-base-dev
sudo apt-get install gfortran

April 14, 2016

R functions for analyzing missing data

Filed under: Math/Statistics,Programming — Tags: , — Emil O. W. Kirkegaard @ 14:46

I’m reading Missing Data: A Gentle Introduction and it mentions various methods to understand how data are missing in a given dataset. The book, however, is light on actual tools. So, since I have already implemented a few functions in my package for handling missing data, I decided to implement a few more. These have the prefix miss_ to easily find them. I have renamed the existing functions to also follow this pattern if necessary (keeping the old names working as well so as not to break existing code).

Amount and distribution of missing data

The authors distinguish between three ways of measuring the amount of missing data: 1) by case, 2) by variable, and 3) by cell. Or in plain English: which proportion of cases/vars/cells have at least one missing variable? miss_amount calculates these three proportions:

> miss_amount(iris_miss)
cases with missing data  vars with missing data cells with missing data 
                   0.42                    1.00                    0.10

The dataset used, iris_miss, is the famous iris dataset (N=150) where I have removed 10% of the data at random. Luckily, the results tell us that indeed 10% of the cells have missing values. 42% of the cases have at least one missing datapoint.

However, it is usually more interesting to know more than just whether or not there is missing data by case/var, but also how many. miss_by_case calculates the number of missing datapoints by case:

> miss_by_case(iris_miss)
  [1] 0 1 0 1 0 0 0 0 0 2 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 2 0 0 0 3 0 0 0 1 1 0 0 1 1 1 0 0 1 0 2 1 1 0 1 0 0 0 0 0 3 2 0
 [59] 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 4 0 0
[117] 1 0 1 1 1 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1

This is most useful when combined with table:

> miss_by_case(iris_miss) %>% table()
.
 0  1  2  3  4 
87 55  5  2  1

Thus, we see that most cases that miss data actually only miss 1 datapoint. One can count the missing datapoints by variable with miss_by_var:

> miss_by_var(iris_miss)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          16           14           16           12           17

The above information is most easily understood with the help of a plot, made with miss_plot:

> miss_plot(iris_miss)
> ggsave("miss_plot_case.png")
Saving 8.95 x 5.67 in image
> miss_plot(iris_miss, case = F)
> ggsave("miss_plot_var.png")
Saving 8.95 x 5.67 in image

miss_plot_casemiss_plot_var

The first has number of missing datapoints by case and the second has by variable.

Patterns of missing data

Statisticians talk about three ways data can be missing that is related to the statistical ‘mechanism’ that caused them to be missing. They have confusing names, so I will just call them type 1, 2 and 3 (MCAR, MAR and MNAR). Type 1 is data that is just missing at random. For instance, if we delete datapoints at random from a dataset as we did above, they are missing at random. Or, if we instead of giving participants all items, we gave them a random subset, they would also be missing at random.

Type 2 is when the data are missing and this is some (possibly probabilistic) function of the other data in the dataset. In this way, the data are missing but not at random. For instance, if younger children tend not to report their own weights because they don’t know them. However, they report their age and the missingness of the weight is related to their age.

Type 3 is when the data are missing but that this is a function of the very same datapoints that are missing. For instance, if we ask people whether they have a criminal record and people with criminal records tend to not want to acknowledge this.

Type 1 missing data we can just ignore. There is no particular pattern in how it is missing. This reduces our sample size, but does not bias the estimates of the relationships in any particular direction. Type 2 missing data, we can perhaps deal with because we can impute it based on the other variable(s) in the dataset. Type 3 missing data, however, there is not much we can directly do about. If we have other datasets with the same variable (e.g. police records), then we can compare the distributions. This would allow us to see that in our sample, the proportion of people with a criminal record is too low, so we suspect that the persons with criminal records tend to not report them.

How can we figure out which type of missingness we have? First, we can create a matrix of the missing data. This is not so useful in itself, but it is useful for further analyses. miss_matrix creates this:

> iris_miss[1:10, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2    <NA>
3           4.7         3.2          1.3         0.2  setosa
4           4.6          NA          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10           NA         3.1          1.5         0.1    <NA>
> miss_matrix(iris_miss[1:10, ])
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1             0           0            0           0       0
2             0           0            0           0       1
3             0           0            0           0       0
4             0           1            0           0       0
5             0           0            0           0       0
6             0           0            0           0       0
7             0           0            0           0       0
8             0           0            0           0       0
9             0           0            0           0       0
10            1           0            0           0       1

To save save, we only look at the first 10 rows. We see that some data are missing (NA). In the matrix below, we see that these cells have 1’s and the others have 0’s. Simple.

If we inspect the first ten cases with missing data, we can see that they tend to miss data in different ways:

> iris_miss[miss_by_case(iris_miss) >= 1, ] %>% head(10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2           4.9         3.0          1.4         0.2    <NA>
4           4.6          NA          1.5         0.2  setosa
10           NA         3.1          1.5         0.1    <NA>
12          4.8         3.4           NA         0.2  setosa
14          4.3         3.0          1.1         0.1    <NA>
16          5.7          NA          1.5         0.4  setosa
19          5.7          NA          1.7         0.3  setosa
27           NA         3.4          1.6          NA  setosa
31           NA         3.1           NA         0.2    <NA>
35           NA         3.1          1.5         0.2  setosa

We see that case 2 is missing 1 datapoint in the 5th variable, case 4 is missing 1 datapoint in the 2nd variable. Cases 27 and 31 are missing multiple datapoints but in different ways. How can we measure this? We can use the powers of 2 method. We create a vector of powers of 2 that is the same length as the number of variables, so in this case {2, 4, 8, 16, 32}. Then we multiply this vector row-wise with the missing data matrix, and this gives us:

> miss_matrix(iris_miss)[miss_by_case(iris_miss) >= 1, ] %>% head(10) %>% t() %>% multiply_by(c(2, 4, 8, 16, 32)) %>% t()
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2             0           0            0           0      32
4             0           4            0           0       0
10            2           0            0           0      32
12            0           0            8           0       0
14            0           0            0           0      32
16            0           4            0           0       0
19            0           4            0           0       0
27            2           0            0          16       0
31            2           0            8           0      32
35            2           0            0           0       0

Compare this with the missing matrix above. It is the same, except that the cells with 1’s have been replaced with powers of 2 based on the variable they are in. Cells in the first are replaced with 2’s (2^1 = 2), those in the second with 4’s (2^2 = 4) and so on. This itself is not useful, but when we sum the numbers by row, we get a unique ID for each way the data can be missing. These values are calculated by miss_pattern:

> miss_pattern(iris_miss)
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
  0  32   0   4   0   0   0   0   0  34   0   8   0  32   0   4   0   0   4   0   0   0   0   0   0   0  18   0   0   0 
 31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
 42   0   0   0   2   4   0   0  32  32   8   0   0  32   0  18   2   2   0   2   0   0   0   0   0  52  20   0   8   0 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
  0   2   0   2   8   8  32   2   0   8  32  32   0   2  16   0   0   0   2   0   0   0   0   0   0  16   4   0   0   4 
 91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
  0  16   0   4  32   8   0   2   2   0   0   8  32   2  16   4   0   0   0   0  32   4   0  60   0   0   8   0   8   4 
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
  4   0   0   0   0  40   0   0   0   0   0   0   0  32   0  16   8   0   0   0   8  16   0   0   0   8   0   0   0  16

The results given this way are not so understandable, but we can calculate the table from them:

> miss_pattern(iris_miss) %>% table()
.
 0  2  4  8 16 18 20 32 34 40 42 52 60 
87 12 11 13  7  2  1 12  1  1  1  1  1

Now we see it more clearly. These are the patterns of missing data; the ways data can be missing. The powers of 2 are the simple cases of 1 datapoint missing in different variables which we saw earlier. The others are composites. For instance, 18 is 2+16, thus representing cases where data are missing in variable 1 (2) and variable 4 (16).

This provides us with a measure of the complexity of the missing data, namely, we count the number of patterns found in the data and the number of cases. The fraction of this represents the complexity. miss_complexity calculates this:

> miss_complexity(iris_miss)
[1] 0.04666667

Thus, in our case, the pattern of missing data is relatively simple.

Type of missing data?

Sometimes we don’t know how the data is missing. It may be type 1, 2, 3, or some combination. How do we find out? We can distinguish between type 1 and 2 by analyzing how the missing data relates to other variables. If the data are missing at random, there should be no observable relationships between data missing in one variable and values of the other variables aside from chance. Examining an entire dataset for non-random missingness thus requires doing a lot of calculations because we must compare the missing data in each variable with each other variable’s data, thus p*(p-1) comparisons. miss_analyze does this for us:

> miss_analyze(iris_miss)
             Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length           NA  0.06736091   0.41567496  0.46781290    0.40
Sepal.Width   -0.02591574          NA  -0.08697338  0.02125442    0.01
Petal.Length  -0.37577044  0.09577354           NA -0.42815844    0.25
Petal.Width   -0.55242759 -0.06717756  -0.40272417          NA    0.15
Species        0.27082506 -0.01113144   0.22816556  0.25833418      NA

Each cell is the effect size (Cohen’s d) calculated as splitting the values of the other variable into 2 groups based on whether they miss data in the first variable. The grouping variable is on the left and the outcome variable is on the top. Thus, we see that there seems to be some patterns in how the data are missing for this dataset. For instance, cases that have missing data for Petal.Length have smaller Sepal.Lengths (d = -38). Since we know this dataset actually has data missing at random, these must represent chance findings. If we look at all the numbers together, we generally do see that the values are smallish.

Note that because one cannot calculate Cohen’s d/correlations for nominal variables (like Species), I have used Cramer’s V converted to Cohen’s d instead. This has the implication that the value cannot be negative.

Our example dataset has N=150. On average, Cohen’s d values will then be calculated from groups with sizes 10 and 140, so if we do 20 comparisons as we did above, we are likely to observe some largeish values, just by chance. The largest value above is .55, representing a medium sized effect. The mean absolute d is .23.

If we examined a larger dataset that also has data missing at random, we should see much smaller values:

> set.seed(1)
> rnorm(5e3) %>% matrix(nrow = 1000) %>% as.data.frame() %>% df_addNA() %>% miss_analyze()
            V1          V2          V3          V4          V5
V1          NA -0.07951442  0.19249439 -0.01830883 -0.11540971
V2 -0.09698717          NA -0.11780557 -0.05403185  0.09778738
V3  0.11954180  0.03310052          NA -0.01917229  0.13087361
V4 -0.06356078  0.12931941  0.02131987          NA  0.26977611
V5 -0.16705158 -0.01179713  0.01402070  0.03015518          NA

Here, we generated a dataset with 1000 cases and 5 variables, then added 10% missing data at random as before. We do see that the values have become much smaller. The mean abs. d is .07.

R code

The functions can be found in my github package: https://github.com/Deleetdk/kirkegaard

The example code: https://gist.github.com/Deleetdk/fcfcc6ba6f1c39a234717dc3e0c83369

March 24, 2016

kirkegaard: Plot contingency table with ggplot2

Filed under: Programming — Tags: , , , — Emil O. W. Kirkegaard @ 00:27

This is a post in the on-going series about stuff in my package: kirkegaard [I’m not egocentric but since there is no central theme about the functions in the package other than I made and use them, there is nothing else to call it.]

I figure it should be easy to find someone who wrote a ready-made function to plot contingency matrices with ggplot2. Not so apparently!

However, it is a relatively simple matter. So here’s one:

Features:

  • Easily plot a decent contingency table with ggplot2. Just input the datafile and the names of the two variables.
  • Can also calculate marginal values, which is to say that we set either all rows or columns to be 1.
  • Automatically sets the axis labels to the names you used, which is often enough.
library(devtools); install_github("deleetdk/kirkegaard"); library(kirkegaard)
GG_contingency_table(mpg, "drv", "cyl")

contin_table_2

GG_contingency_table(mpg, "drv", "cyl", margin = 1)

contin_table_1

GG_contingency_table(mpg, "drv", "cyl", margin = 2)

contin_table_2

 

March 14, 2016

Making use of list-arrays in R

Filed under: Programming — Tags: , — Emil O. W. Kirkegaard @ 13:30

Quick recap of the main object types in R from Advanced R:

homogeneous heterogeneous
1-d atomic vector list
2-d matrix data.frame
n-d array ???

So, objects can either store only data of the same type or of any type, and they can have 1, 2 or any number of dimensions. Note that there is a missing cell. But actually, all the object types in the right column are lists. Data.frames are lists that look and behave like matrices in some ways. So, if we can make a list behave like a matrix, can we make a list behave like an n-dimensional array? Yes: list-arrays.

Why would we want an array that can take any type of data? Well, as argued before, one can use this data structure to automate analysis across methods and subsets. In other words, make it easy to test the effects of researcher degrees of freedom. What if we didn’t make that correction to the data to make it more normal? What if we split the data into Whites and Blacks, or analyze them together? What about splitting the data into gender: both genders, males, females? Examining all these manually requires doing the analysis 2 * 3 * 3 = 18 times. This would probably make for very long code that would break easily and be hard to update. In other words, violate the DRY principle.

In some cases, to do the analysis across many settings, one needs to save data.frames for each setting. This is fairly cumbersome using just lists, but can be done. If we only need to enter different parameters, just storing all of them in a data.frame would be sufficient. In my case, however, I needed to repeat an analysis across a large number of (sub-)samples that were derived using two parameters.

So, how to set up the initial list-array? I wrote a simple function to do this:

t = make_list_array(1:2, letters[1:2], LETTERS[1:2]); t
# , , A
# 
#   a  b 
# 1 NA NA
# 2 NA NA
# 
# , , B
# 
#   a  b 
# 1 NA NA
# 2 NA NA

One can give the function either vectors or whole numbers. If one gives it vectors, it uses their length and their names to construct the list-array. If one gives it whole numbers, it uses that as the length and just uses the natural numbers as the names.

I’m still not sure if using list-arrays is superior to just using one long 1-d list and storing the parameter values in a data.frame made using expand.grid. For instance:

do.call("expand.grid", args = dimnames(t))
#   Var1 Var2 Var3
# 1    1    a    A
# 2    2    a    A
# 3    1    b    A
# 4    2    b    A
# 5    1    a    B
# 6    2    a    B
# 7    1    b    B
# 8    2    b    B

Time will tell.

March 9, 2016

Excluding missing or bad data in R: not as easy as it should be!

Filed under: Programming — Tags: , — Emil O. W. Kirkegaard @ 17:47

On-going series of posts about functions in my R package (https://github.com/Deleetdk/kirkegaard ).

Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R’s built-in functions for handling missing (or bad) data are:

  • is.na
  • is.nan
  • is.infinite / is.finite
  • is.null

Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:

x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list
is.na(x)
#> [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

So, is.na actually matches NaN as well. What about is.nan?

is.nan(x)
#> Error in is.nan(x) : default method not implemented for type 'list'

But that turns out not to be vectorized. But it gets worse:

sapply(x, is.nan)
#> [[1]]
#> [1] FALSE
#> 
#> [[2]]
#> [1] FALSE
#> 
#> [[3]]
#> [1] FALSE
#> 
#> [[4]]
#> logical(0)
#> 
#> [[5]]
#> [1] FALSE
#> 
#> [[6]]
#> [1] TRUE
#> 
#> [[7]]
#> [1] FALSE
#> 
#> [[8]]
#> [1] FALSE

Note that calling is.nan on NULL returns an empty logical vector (logical(0)) instead of FALSE. This also changes the output from sapply to a list instead of a vector we can subset with. is.infinite behaves the same way: not vectorized and gives logical(0) for NULL.

But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:

are_equal(exclude_missing(x), list(1, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf))
#> [1] TRUE

So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions.

Edited

Turns out that there are more problems:

is.na(list(NA))
#>
[1] TRUE

So, for some reason, is.na returns TRUE when given a list with NA. This shouldn’t happen I think.

January 1, 2016

R: fastest way of finding out of all elements of a vector are identical?

Filed under: Programming — Tags: — Emil O. W. Kirkegaard @ 23:25

There is a question on SO about this: http://stackoverflow.com/questions/4752275/test-for-equality-among-all-elements-of-a-single-vector

But I was a bit more curious, so!

#test data, large vectors
v1 = rep(1234, 1e6)
v2 = runif(1e6)

#functions to try
all_the_same1 = function(x) {
  range(x) == 0
}

all_the_same2 = function(x) {
  max(x) == min(x)
}

all_the_same3 = function(x) {
  sd(x) == 0
}

all_the_same4 = function(x) {
  var(x) == 0
}

all_the_same5 = function(x) {
  x = x - mean(x)
  all(x == 0)
}

all_the_same6 = function(x) {
  length(unique(x)) == 1
}

Simple enough, 6 functions to try and some test data. Then we benchmark:

library("microbenchmark")

microbenchmark(all_the_same1(v1),
               all_the_same2(v1),
               all_the_same3(v1),
               all_the_same4(v1),
               all_the_same5(v1),
               all_the_same6(v1))

microbenchmark(all_the_same1(v2),
               all_the_same2(v2),
               all_the_same3(v2),
               all_the_same4(v2),
               all_the_same5(v2),
               all_the_same6(v2))

Results (for me) look like this, for the first vector:

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval
 all_the_same1(v1)  4.321870  4.393697  5.554033  4.442849  4.690950 73.166929   100
 all_the_same2(v1)  2.467258  2.494175  2.522648  2.509389  2.534696  2.706289   100
 all_the_same3(v1)  3.661536  3.701472  3.783067  3.736434  3.810016  4.496828   100
 all_the_same4(v1)  3.657147  3.708786  3.774908  3.746528  3.804603  4.343520   100
 all_the_same5(v1)  6.850276  7.029768  8.515351  7.227547  9.164957 73.208182   100
 all_the_same6(v1) 15.083830 15.217973 15.977563 15.400977 17.114863 18.679829   100

And the second:

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval
 all_the_same1(v2)  4.304317  4.393111  4.868236  4.487904  4.730886  6.729151   100
 all_the_same2(v2)  2.468428  2.498563  2.556797  2.521823  2.570829  3.447666   100
 all_the_same3(v2)  3.643104  3.715955  3.822649  3.756476  3.866921  5.887129   100
 all_the_same4(v2)  3.634034  3.704983  3.793290  3.765984  3.827717  4.488051   100
 all_the_same5(v2)  6.031952  6.206910  9.855640  6.447258  8.357751 80.946411   100
 all_the_same6(v2) 67.558621 69.806449 72.405274 71.504097 73.557513 88.433322   100

In both cases, function #2 is the fastest one. It also does not depend much on the actual data given.

December 30, 2015

R: assign() inside nested functions

Filed under: Programming — Tags: , , , — Emil O. W. Kirkegaard @ 21:06

Recently, I wrote a function called copy_names(). It does what you think and a little more: it copies names from one object to another. But it can also attempt to do so even when the sizes of the objects’ dimensions do not match up perfectly. For instance:

> t = matrix(1:9, nrow=3)
> t2 = t
> rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
> t
  a b c
A 1 4 7
B 2 5 8
C 3 6 9
> t2
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> copy_names(t, t2)
> t2
  a b c
A 1 4 7
B 2 5 8
C 3 6 9

Here we create a matrix and make a copy of it. Then we assign dimension names to the first object. Then we inspect both of them. Unsurprisingly, only the first has names (because R uses copy-on-modify semantics). Then we call the copy function and then afterwards we see that the second gets the named copied. Hooray!

What if there is imperfect matching? The function will first check whether the number of dimensions is the same and if so, it checks each dimension to see if the lengths match in that dimension. If so, the names are copied. If not, nothing is done. For instance:

> t = matrix(1:6, nrow=3)
> t2 = matrix(1:9, nrow=3)
> rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:2]
> t
  a b
A 1 4
B 2 5
C 3 6
> t2
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> copy_names(t, t2)
> t2
  [,1] [,2] [,3]
A    1    4    7
B    2    5    8
C    3    6    9

Here we create two matrices, but not of exactly the same sizes: the first is 3×2 and the second is 3×3. Then we assign dimnames to the first. Then we copy to the second and inspect. We see that the only the dimension that matched in length (i.e. the first) had the names copied.

How does it work?

Before I changed it, the code looked like this (including roxygen2 documentation):

#' Copy names from one object to another.
#'
#' Attempts to copy names that fit the dimensions of vectors, lists, matrices and data.frames.
#' @param x (an object) An object whose dimnames should be copied.
#' @param y (an object) An object whose dimensions that should be renamed.
#' @keywords names, rownames, colnames, copy
#' @export
#' @examples
#' m = matrix(1:9, nrow=3)
#' n = m
#' rownames(m) = letters[1:3]
#' colnames(m) = LETTERS[1:3]
#' copy_names(m, n)
#' n
copy_names = function(x, y, partialmatching = T) {
  library(stringr)
  #find object dimensions
  x_dims = get_dims(x)
  y_dims = get_dims(y)
  same_n_dimensions = length(x_dims) == length(y_dims)

  #what is the object in y parameter?
  y_obj_name = deparse(substitute(y))

  #perfect matching
  if (!partialmatching) {
    #set names if matching dims
    if (all(x_dims == y_dims)) {
      attr(y, "dimnames") = attr(x, "dimnames")
    } else {
      stop(str_c("Dimensions did not match! ", x_dims, " vs. ", y_dims))
    }
  }

  #if using partial matching and dimensions match in number
  if (same_n_dimensions && partialmatching) {
    #loop over each dimension
    for (dim in 1:length(dimnames(x))) {
      #do lengths match?
      if (x_dims[dim] == y_dims[dim]) {
        dimnames(y)[[dim]] = dimnames(x)[[dim]]
      }
    }
  }

  #assign in the outer envir
  assign(y_obj_name, value = y, pos = 1)
}

The call that does the trick is the last one, namely the one using assign(). Here we modify an object outside the function’s own environment. How do we know which one to modify? Well, we take one step back (pos = 1). Alternatively, one could have used <<-.

Inside nested functions

However, consider this scenario:

> x = 1
> func1 = function() {
+   x = 2
+   print(paste0("x inside func1 before running func2 is ", x))
+   func2()
+   print(paste0("x inside func1 after running func2 is ", x))
+ }
> 
> func2 = function() {
+   print(paste0("x inside func2 is ", x))
+   print(where("x"))
+   assign("x", value = 3, pos = 1)
+   #x <<- 3
+ }
> 
> x
[1] 1
> func1()
[1] "x inside func1 before running func2 is 2"
[1] "x inside func2 is 1"
<environment: R_GlobalEnv>
[1] "x inside func1 after running func2 is 2"
> x
[1] 3
> 
> x = 1
> func2()
[1] "x inside func2 is 1"
<environment: R_GlobalEnv>
> x
[1] 3

Here we define two functions, one of which calls the other. We also define x outside (in the global environment). Inside func1() we also define x to be another value. However, note the strange result inside func2. When asked to fetch x, which doesn’t exist in that function’s environment, it returns the value from the… global environment (i.e. x=1), not the func1() environment (x=2)! This is odd because func2() was called from func1(), so one would expect it to try getting it from there before trying the global environment. When we then call x in the global environment after the functions finish, we see that x has been changed there, not inside func2() as might be expected. This is a problem because if we call copy_names() inside a function, it is supposed to change the names of the object inside the function, not inside the global environment.

Why is this? It is complicated, but as far as I can make out, it is due to the difference between the calling environment (where we call the function from) and the enclosing environment (where it was created, in the case above the global environment). R by default will look up variables in the enclosing environment, not the calling environment. assign() using pos = 1 apparently does not work with the calling environments, but the enclosing environments, and hence it changes the value in the global environment, not the function that called it’s environment as intended.

The fix is to use the following line instead:

assign("x", value = 3, envir = parent.frame())

which then assigns the value to the object in the right environment, namely in func1()’s.

copy_names() part 2

This also means that copy_names() does not work within functions. For instance:

> get_loadings = function(fa) {
+ library(magrittr)
+ df = loadings(fa) %>% as.vector %>% matrix(nrow=nrow(fa$loadings)) %>% as.data.frame
+ loads = loadings(fa)
+ copy_names(loads, df)
+ return(df)
+ }
> library("psych")
> iris_fa = fa(iris[-5])
> get_loadings(iris_fa)
          V1
1  0.8713121
2 -0.4225686
3  0.9975472
4  0.9646774

Above, we define a new function, get_loadings(), that fetches the loadings from a factor analysis object and transforms it into a clean data.frame by a roundabout way.* We see that the object returned did not keep the dimnames despite altho copy_names() being called. The fix is the same as above, calling assign with envir = parent.frame().

* The reason to use the roundabout way is that the loadings extracted have some odd properties that make them unusable in many functions and they also refuse to be converted to a data.frame. But it turns out that one can just change the class to “matrix” and then they are fine! So one doesn’t actually need copy_names() in this case after all.

December 13, 2015

kirkegaard: conditional recoding with conditional_change()

Filed under: Programming — Tags: , , — Emil O. W. Kirkegaard @ 15:02

Usually working with large public datasets requires that one recode variables. This can be quite repetitive. When variables only have a few possible values, one can use something like plyr‘s mapvalues() for great benefit (see my answer at SO). However, when there is an indefinite number of different values, it is not useful. What one wants to do is conditional recoding, that is, apply some test to each value and use the result to determine whether to change the value or not. It can be somewhat messy with the base r approach:

d$R0006500[d$R0006500<0] <- NA
d$R0007900[d$R0007900<0] <- NA
d$R0002200[d$R0002200<0] <- NA
d$R0002500[d$R0002500<0] <- NA
d$R0217900[d$R0217900<0] <- NA
d$R0618301[d$R0618301<0] <- NA
d$R7007300[d$R7007300<0] <- NA

Above we recode 7 variables from values below 0 to NA (variables are from the NLSY dataset). There is a clear violation of DRY. So, one should make a better approach. I came up with this:

v_vars = c("R0006500", "R0007900", "R0002200", "R0002500", "R0217900", "R0618301", "R7007300")
for (var in v_vars){
  d[var] = conditional_change(d[var], func_str = "<0", new_value = NA)
}

Which is somewhat shorter (222 vs. 191 chars)  and could easily handle more variables. It is build using a functional programming approach to remapping, where one supplies a function that gives a boolean output which is then used to remap values. To understand how it works, I will go over the functions it makes use of.

Condition functions

These are simple functions that return a boolean when applied to a vector. For instance, I have made is_negatve(), which works as expected:

> is_negative(-1:1)
[1]  TRUE FALSE FALSE

There are also companion functions: is_positive() and is_zero(). However, those are only three options, and optimally, we would not want to have to write anonymous functions every time we want something slightly different, like a test of whether a value is below 5. Therefore, we need a function factory: math_to_function(). This function takes a string and returns a condition function like the ones above. For instance:

> less_than_five = math_to_function("<5")
> less_than_five(1:10)
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

First we make a new function that tests for less than five, then we test it on a vector.

After that, we are ready to understand conditional_change(). It takes some kind of object whose values to change, either a function or a function string (passed to math_to_function()), and the new value to be used. For instance:

> conditional_change(1:10, func_str = "<5", new_value = NA)
 [1] NA NA NA NA  5  6  7  8  9 10
> conditional_change(1:10, func_str = "> 9", new_value = NA)
 [1]  1  2  3  4  5  6  7  8  9 NA
> conditional_change(data.frame(1:10), func_str = "> 9", new_value = NA)
   X1.10
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    NA
> conditional_change(list(1:10, 1:10), func_str = "> 9", new_value = NA)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 NA
[[2]]
 [1]  1  2  3  4  5  6  7  8  9 NA

And so on.

Future improvements

There are various ways one could improve upon the approach taken here. First, one could vectorize the function such that conditional_change() could take a list of functions/vector of function strings and a vector of return values and apply each in turn. This would save some writing in some cases. Second, instead of condition functions, one could use functions that return the new value. So one could use functions like:

recode_outliers = function(x) {
  x[x < -2] = -2
  x[x > 2] = 2
  return(x)
}
> recode_outliers(-5:5)
 [1] -2 -2 -2 -2 -1  0  1  2  2  2  2
> recode_outliers(matrix(-4:4, nrow=3))
     [,1] [,2] [,3]
[1,]   -2   -1    2
[2,]   -2    0    2
[3,]   -2    1    2

Perhaps this is a better approach.

As always, code is on github.

December 10, 2015

Ethnic heterogeneity and tail effects

Filed under: Differential psychology/psychometrics — Tags: , , — Emil O. W. Kirkegaard @ 15:38

Chisala has his 3rd installment up: http://www.unz.com/article/closing-the-black-white-iq-gap-debate-part-3/

One idea I had while reading it was that tail effects interact with population ethnic/racial heterogeneity. To show this, I did a simulation experiment. Population 1 is a regular population with a mean of 0 and sd of 1. Population 2 is a composite population of three sub-populations: one with a mean of 0 (80%; “normals”) one with mean of -1 (10%; “dullards”) and one with a mean of 1 (10%; “brights”). Population 3 is a normal population but with a slightly increased sd so that it is equal to the sd of population 2.

Descriptive stats:

> describe(df, skew = F, ranges = T)
     vars     n mean  sd median trimmed  mad   min  max range se
pop1    1 1e+06    0 1.0      0       0 1.00 -4.88 4.65  9.53  0
pop2    2 1e+06    0 1.1      0       0 1.09 -5.43 5.37 10.80  0
pop3    3 1e+06    0 1.1      0       0 1.09 -5.30 5.13 10.44  0

We see that the sd is increased a bit in the composite population (2) as expected. We also see that the range is somewhat increased, even compared to population 3 which has the same sd.

How do the tails look like?

> sapply(df, percent_cutoff, cutoff = 1:4)
      pop1     pop2     pop3
1 0.158830 0.179495 0.180856
2 0.022903 0.034342 0.034074
3 0.001314 0.003326 0.003126
4 0.000036 0.000160 0.000150

We are looking at the proportions of persons with scores above 1-4 (rows) by each population (cols). What do we see? Population 2 and 3 have clear advantages over population 1, but population 2 has a slight advantage over population 3 too.

Simulation 2

In the above, the composite population is made out of 3 populations. But what if it were instead made out of 5?

Descriptives:

> describe(df, skew = F)
     vars     n mean   sd median trimmed  mad   min  max range se
pop1    1 1e+06    0 1.00      0       0 1.00 -4.88 4.65  9.53  0
pop2    2 1e+06    0 1.27      0       0 1.21 -5.91 6.03 11.94  0
pop3    3 1e+06    0 1.27      0       0 1.26 -6.12 5.92 12.04  0

The sd is clearly increased. There is not much difference in the range, but the range is very susceptible to sampling error, which we have. How do the tails look like?

> sapply(df, percent_cutoff, cutoff = 1:4)
      pop1     pop2     pop3
1 0.158830 0.205814 0.214353
2 0.022903 0.057077 0.056874
3 0.001314 0.011057 0.008872
4 0.000036 0.001246 0.000804

We see strong effects. At the +3 level, there are roughly 10x as many persons in the composite population as in the normal population. Population 3 also has more, but clearly fewer than the composite population.

We can conclude that one must take heterogeneity of populations into account when thinking about the tails.

R code

You can re-do the experiment yourself with this code, or try out some other numbers.

library(pacman)
 p_load(reshape, kirkegaard, psych)
n = 1e6
# first simulation --------------------------------------------------------
 set.seed(1)
 {
 pop1 = rnorm(n)
 pop2 = c(rnorm(n*.8), rnorm(n*.1, 1), rnorm(n*.1, -1))
 pop3 = rnorm(n, sd = sd(pop2))
 }
#df
 df = data.frame(pop1, pop2, pop3)
#stats
 describe(df, skew = F)
 sapply(df, percent_cutoff, cutoff = 1:4)
# second simulation -------------------------------------------------------
 set.seed(1)
 {
 pop1 = rnorm(n)
 pop2 = c(rnorm(n*.70), rnorm(n*.10, 1), rnorm(n*.10, -1), rnorm(n*.05, 2), rnorm(n*.05, -2))
 pop3 = rnorm(n, sd = sd(pop2))
 }
#df
 df = data.frame(pop1, pop2, pop3)
#stats
 describe(df, skew = F)
 sapply(df, percent_cutoff, cutoff = 1:4)
Older Posts »

Powered by WordPress