Clear Language, Clear Mind

January 19, 2015

Sometimes doing elementary things in R is a pain

Filed under: Uncategorized — Tags: , — Emil O. W. Kirkegaard @ 05:51

Getting a percentage table from a dataframe

A reviewer asked me to:

1) As I said earlier, there should be some data on the countries of origin of the immigrant population. Most readers have no idea who actually moves to Denmark. At the very least, there should be basic information like “x% of the immigrant population is of non-European origin and y% of European origin as of 2014.” Generally, non-European immigration would be expected to increase inequality more, given that IQ levels are relatively uniform across Europe.

I have population counts for each year 1980 through 2014 in a dataframe and I’d like to get them as a percent of each year so as to get the relative sizes of the countries. There is a premade function for this, prop.table, however, it works quite strangely. If one gives it a dataframe and no margin, it will use the total sum of the data.frame instead of by column. This is sometimes useful, but not in this case. However, if one gives it a data.frame and margin=2, it will complain that:

Error in margin.table(x, margin) : 'x' is not an array

Which is odd when it just accepted it before. The relatively lack of documentation made it not quite easy to figure out how to make it work. Turns out that one just has to convert the dataframe to a matrix when giving it:

census.percent = prop.table(as.matrix(census), margin=2)

and then one can convert it back and also multiple by 100 to get percent instead of fractions:

census.percent =, margin=2)*100)

Getting the top 10 countries with names for selected years

This one was harder. Here’s the code I ended up with:

selected.years = c("X1980","X1990","X2000","X2010","X2014") #years of interest
for (year in selected.years){ #loop over each year of interest
  vector = census.percent[,year,drop=FALSE] #get the vector, DONT DROP!
  View(round(vector[order(vector, decreasing = TRUE),,drop=FALSE][1:10,,drop=FALSE],1)) #sort vector, DONT drop! and get 1:10 and DONT DROP!

First we choose the years we want (note that X goes in front because R has trouble handling columns that begin with a number). Then we loop over each year of interest. Then we pick it out to avoid having to select the same column over and over. However, normally when picking out 1 column from a dataframe, R will convert it to numeric, which is very bad because this removes the rownames. That means that even tho we can find the top 10 countries, we don’t know which ones they are. The solution for this is to set drop=FALSE. The next part consists of first ordering the vector (without drop!), and then selecting the top 10 countries without dropping. I open them in View (in Rstudio) because this makes it easier to copy the values for further use (e.g. in a table for a paper).

So, drop=FALSE is another one of those pesky small things to remember. It is just like stringsAsFactors=FALSE when using read.table (or read.csv).


October 21, 2014

Is the summed cubes equal to the squared sum of counting integer series?

Filed under: Math/Statistics — Tags: , — Emil O. W. Kirkegaard @ 12:19

R can tell us:

DF.numbers = data.frame(cubesum=numeric(),sumsquare=numeric()) #initial dataframe
for (n in 1:100){ #loop and fill in
  DF.numbers[n,"cubesum"] = sum((1:n)^3)
  DF.numbers[n,"sumsquare"] = sum(1:n)^2

library(car) #for the scatterplot() function
scatterplot(cubesum ~ sumsquare, DF.numbers,
            smoother=FALSE, #no moving average
            labels = rownames(DF.numbers), id.n = nrow(DF.numbers), #labels
            log = "xy", #logscales
            main = "Cubesum is identical to sumsquare, proven by induction")

#checks that they are identical, except for the name
all.equal(DF.numbers["cubesum"],DF.numbers["sumsquare"], check.names=FALSE)



One can increase the number in the loop to test more numbers. I did test it with 1:10000, and it was still true.

August 31, 2014

Comments on Learning Statistics with R

Filed under: Differential psychology/psychometrics,Psychology — Tags: , , , — Emil O. W. Kirkegaard @ 23:15

So I found a textbook for learning both elementary statistics much of which i knew but hadnt read a textbook about, and for learning R. book is free legally

Numbers refer to the page number in the book. The book is in an early version (“0.4”) so many of these are small errors i stumbled upon while going thru virtually all commands in the book in my own R window.



These modeOf() and maxFreq() does not work. This is because the afl.finalists is a factor and they demand a vector. One can use as.vector() to make them work.



Worth noting that summary() is the same as quartile() except that it also includes the mean.



Actually, the output of describe() is not telling us the number of NA. It is only because the author assumes that there are 100 total cases that he can do 100-n and get the number of NAs for each var.



The cakes.Rdata is already transposed.



as.logical also converts numeric 0 and 1 to F and T. However, oddly, it does not understand “0” and “1”.



Actually P(0) is not equivalent with impossible. See:



Actually 100 simulations with N=20 will generally not result in a histogram like the above. Perhaps it is better to change the command to K=1000. And why not add hist() to it so it can be visually compared to the theoretic one?


hist(rbinom( n = 1000, size = 20, prob = 1/6 ))


It would be nice if the code for making these simulations was shown.



“This is just bizarre: σ ˆ 2 is and unbiased estimate of the population variance”





Typo in Figure 11.6 text. “Notice that when θ actually is equal to .05 (plotted as a black dot)”




“That is, what values of X2 would lead is to reject the null hypothesis.”



It is most annoying that the author doesn’t write the code for reproducing his plots. I spent 15 minutes trying to find a function to create histplots by group.





“It works for t-tests, but it wouldn’t be meaningful for chi-square testsm F -tests or indeed for most of the tests I talk about in this book.”



“we see that it is 95% certain that the true (population-wide) average improvement would lie between 0.95% and 1.86%.”


This wording is dangerous because there are two interpretations of the percent sign. In the relative sense, they are wrong. The author means absolute %’s.



The code has +’s in it which means it cannot just be copied and runned. This usually isn’t the case, but it happens a few times in the book.



In the description of the test, we are told to tick when the values are larger than. However, in the one sample version, the author ticks when the value is equal to. I guess this means that we tick when it is equal to or larger than.



This command doesn’t work because the dataframe isn’t attached as the author assumes.

> mood.gain <- list( placebo, joyzepam, anxifree)



First the author says he wants to use the R^2 non-adjusted, but then in the text he uses the adjusted value.



Typo with “Unless” capitalized.



“(3.45 for drug and 0.92 for therapy),”

He must mean .47 for therapy. .92 is the number for residuals.



In the alternates hypothesis, the author uses “u_ij” instead of “u_rc” which is used in the null-hypothesis. I’m guessing the null-hypothesis is right.



As earlier, it is ambiguous when the author talks about increases in percent. It could be relative or absolute. Again in this case it is absolute. The author should use %point or something to avoid confusion.





“I find it amusing to note that the default in R is Type I and the default in SPSS is Type III (with Helmert contrasts). Neither of these appeals to me all that much. Relatedly, I find it depressing that almost nobody in the psychological literature ever bothers to report which Type of tests they ran, much less the order of variables (for Type I) or the contrasts used (for Type III). Often they don’t report what software they used either. The only way I can ever make any sense of what people typically report is to try to guess from auxiliary cues which software they were using, and to assume that they never changed the default settings. Please don’t do this… now that you know about these issues, make sure you indicate what software you used, and if you’re reporting ANOVA results for unbalanced data, then specify what Type of tests you ran, specify order information if you’ve done Type I tests and specify contrasts if you’ve done Type III tests. Or, even better, do hypotheses tests that correspond to things you really care about, and then report those!”


An exmaple of the necessity of open methods along with open data. Science must be reproducible. The best is to simply share the exact source code to the the analyses in a paper.

June 13, 2013

Ripping books from UMDL Text: Leta S. Hollingworth’s Gifted children, their nature and nurture

Filed under: Differential psychology/psychometrics,Education — Tags: — Emil O. W. Kirkegaard @ 22:03

Due to this book repeatedly coming up in conversation regarding the super smart people, it seems to be worth reading. It is really old, and should obviously be out of copyright (thanks Disney!), however it possibly isn’t and in any case I couldn’t find a useful PDF.

I did however find the above. Now, it seems to lack a download all function, and it’s too much of a hassle to download all 398 pictures manually. They also lack OCR. So, I set out to write a python script to download them.

First had to find a function to actually download files. So far I had only read the page source of pages and worked with that. This time I needed to save a file (picture) repeatedly.

Googling gave me: urllib.urlretrieve

So far so good right? Must be easy to just do a for loop now and get it overwith.

Not entirely. Turns out that the pictures are only stored temporarily in the website cache if one visits the page associated with the picture. So I had to make the script load the page before getting the picture. Slows it down a bit, but not too much trouble.

Next problem: sometimes the picture wasn’t downloaded correctly for some reason. The filesize however was a useful proxy for this purpose. So had to find a way to get the filesize. Google gave me os.stat (import os). Problem solved.

So after doing that as well, some pictures were still not being downloaded correctly. Weird. After debugging, it turned out that some of the pictures were not .gif but .jpg files, and located in a slightly different place. So then I had to modify the code to work that out as well.

Finally worked out for all the pictures.

I OCR’d it with ABBYY FineReader (best OCR on the market AFAIK).



The python code is here:

January 18, 2013

Ripping threads from able2know forums

Filed under: Uncategorized — Tags: — Emil O. W. Kirkegaard @ 05:19

So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.

code is here: Forum

or here:

import urllib2
import re
import os

def isodd(num):
    return num & 1 and True or False

#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')

#assign data to var
data0 = file0.readlines()

#hard var
number = 0

#skip odd numbers
for line in data0:

    #if it is even, then set first var, and continue
    if not isodd(number):
        outputfolder = line
        outputfolder = outputfolder[:-1]
        number = number+1

    #get thread url and remove last chars (linebreak +
    if isodd(number):
        threadurl = line
        threadurl = threadurl[:-2]
        number = number+1
        print "starting to crawl thread "+threadurl

    #create folder
    if not os.path.isdir("Forum threads/"+outputfolder):
        os.makedirs("Forum threads/"+outputfolder)

    #var introduction
    lastdata2 = "ijdsfijkds"
    lastdata = "kjdsfsa"

    #looping over all the pages
    for page in range(999):
        #range starts at 0, so +1 is needed
        response = urllib2.urlopen(threadurl+str(page+1))

        #assign the data to a var
        page_source =

        #used for detection of identical output
        #replace data in var2
        lastdata2 = lastdata

        #load new data into var1
        lastdata = page_source

        #check if they are identical
        if page>0:
            if lastdata == lastdata2:
                print "data identical, stopping loop"

        #alternative check, check len
        if page>0:
            if len(lastdata) == len(lastdata2):
                print "length identical, stopping loop"
                #used for detection of identical output

        #create a file for each page, and save the data in it
        output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')

        print "wrote page "+str(page+1)+" in "+outputfolder+"/"
        print "length of file is "+str(len(page_source))

October 4, 2012

Review of python book and some other thoughts

Filed under: Uncategorized — Tags: — Emil O. W. Kirkegaard @ 22:12

It was mentioned by TechDirt in their reporting on an absurd copyright case (so, pretty normal).



The strange choice of the author to use identation to mark borders between paragrafs when indents are very important in python. He cud just have used newlines to do that.



The code is not easily copyable. If one tries, one gets spaces between every character or so like this: >>>cho i c e = ’ham’. This seems to be due to the font used.



Sometimes the examples are not clearly enuf explained. For instance, elif is explained as an “optional condition”, which is not all that clear. Fortunately, this is not much of a problem if one has an IDE ready to test it. For the record, elif works as an alternative condition if the first one isnt true. Ex.:

a = 1

b = 2

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>a holds


a = 0

b = 2

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>b holds and a doesnt


a = 0

b = 4

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>neither a or b holds, but c does


Note how the order of the elif’s matter. An elif only activates when all the previous if’s and elif’s failed.



Python apparently does not understand how to add numbers and strings. So things like:

a = “string”

b = 1

print a+b

gives an error. It seems to me that one shud just have python autoconvert numbers to strings (using str function), just as python converts integers to floats when adding such two objects together. Perhaps this has changed in python 3ff. Im running 2.7.




Powered by WordPress