It’s a really annoying ‘feature’.

I’m searching for a way to disable the very annoying number formatting in Libre Office Calc. Whenever I enter some number or a string containing numbers, LO is trying to format a date out of it.

Pseudo-solutions:

I’ve found some so-called-solutions, but none of these works.

  • Format cells to “text” – works, but only as long as one pastes or deletes from/to a specific cell. For example, if I paste some content, the formatting is lost again.
  • Start typing with a single quote '. Not an option in daily working routine, just to enter a numeric string.
  • Deselect Tools > AutoCorrect > Options > Apply Numbering – there is no such option in Libre Office (at least not in version 3.5).

Use cases:

  • 2-3 means “two to three whatever”
  • 5.2 is the code following 5.1
  • 6. refers to the sixth item

All those values are translated to some random date in Libre Office by default. I guess they were on drugs when implementing such a bug into the program.

Is there a global setting to turn off that featurebug?

There are two main workarounds: 1) manually adding ‘ in front of numbers, which will force treatment of them as a character string. 2) setting the format to “text” before entering text.

Neither of these are good solutions for everyday work. For instance, if you paste in data from somewhere else, it will not generally have ‘ in front, and it will also override the format you chose. A last trick here is to use “paste special” and then choose the types, which can be a good workaround too.

It’s not a new complaint:

Developers don’t seem to understand the users’ frustration. Instead they write stuff like this:

At first I thought you meant you must handle cell formatting for each cell individually, which of course is false. (You select all the cells and choose “text” as a data type.)
Now if I understand correctly, you want a way to disable automatic date recognition globally, for all spreadsheets? Or at least, you want to know why that is not in the preferences section?
I can at least make a guess at the last question. It is probably for the same reason why there’s not an option to globally turn off automatic recognition of formulas. Because that’s what Calc is for….
In the vast, vast majority of cases, a user will expect that if he types in a date, it will be “understood” by his software as a date. If it doesn’t get “recognized”, he’s going to think “Hmmm, this software isn’t very good.” He’s not going to think, “Hmmm, there must be a global setting somewhere that’s been switched off so that dates don’t get recognized,” and then go hunting for that setting. (If that did happen and he actually found the setting, his inevitable question would be, “Why on earth is that even an option? Who would want to turn off the date recognition for all spreadsheets?” And we’d have to tell him, “Well, it was this guy Swingletree….” ;)

This is a typical example of developers being out of contact with normal users. For normal users (>95% of users), this auto-conversion is more of a bug than a feature, which is why people want to turn it off completely, and then just manually tell Calc when to interpret something as a date.

DMCA email:

VIA EMAIL:

Demand for Immediate Take-Down: Notice of Infringing Activity

Date:      April 24, 2015

URL:      filepost.com/

Dear Sir or Madam,

We have been made aware that the domain listed above, which appears to be on servers under your control, is offering unlicensed copies of, or is engaged in other unauthorized activities relating to copyrighted works published by, Wiley-VCH Verlag GmbH & Co. KGaA.

  1. Identification of copyrighted work(s):

Copyrighted work(s):

Edwards: Human genetic diversity: Lewontin’s fallacy, BioEssays 25:798–801,  2003 Wiley Periodicals, Inc.

Copyright owner or exclusive licensee:

Wiley-VCH Verlag GmbH & Co. KGaA.

  1. Copyright infringing material or activity found at the following location(s):

emilkirkegaard.dk/en/wp-content/uploads/A.W.F.-Edwards-Human-genetic-diversity-Lewontin%E2%80%99s-fallacy.pdf

The above copyrighted work(s) is being made available for copying, through downloading, at the above location without authorization of the copyright owner(s) or exclusive licensee.

  1. Statement of authority:

The information in this notice is accurate, and I hereby certify under penalty of perjury that I am authorized to act on behalf of, Wiley-VCH Verlag GmbH & Co. KGaA., the owner or exclusive licensee of the copyright(s) in the work(s) identified above. I have a good faith belief that none of the materials or activities listed above have been authorized by, Wiley-VCH Verlag GmbH & Co. KGaA., its agents, or the law.

We hereby give notice of these activities to you and request that you take expeditious action to remove or disable access to the material described above, and thereby prevent the illegal reproduction and distribution of this copyrighted work(s) via your company’s network.

We appreciate your cooperation in this matter. Please advise us regarding what actions you take.

Yours sincerely,

Bettina Loycke
Senior Rights Manager
Rights & Licenses

Wiley-VCH Verlag GmbH & Co. KGaA
Boschstraße 12
69469 Weinheim
Germany

www.wiley-vch.de

T          +(49) 6201 606-280
F          +(49) 6201 606-332
rightsDE@wiley.com

My name, typed above, constitutes an electronic signature under Federal law, and is intended to be binding.

Deutsch:
Wiley-VCH Verlag GmbH & Co. KGaA – A company of John Wiley & Sons, Inc. – Sitz der Gesellschaft: Weinheim – Amtsgericht Mannheim, HRB 432833 – Vorsitzender des Aufsichtsrates:
Stephen Michael Smith. Persönlich haftender Gesellschafter: John Wiley & Sons GmbH – Sitz der Gesellschaft: Weinheim – Amtsgericht Mannheim, HRB 432296 – Geschäftsführer: Sabine Steinbach, Dr. Jon Walmsley.

English:
Wiley-VCH Verlag GmbH & Co. KGaA – A company of John Wiley & Sons, Inc. – Location of the Company: Weinheim – Trade Register: Mannheim, HRB 432833.
Chairman of the Supervisory Board: Stephen Michael Smith. General Partner: John Wiley & Sons GmbH, Location: Weinheim – Trade Register Mannheim, HRB 432296 –
Managing Directors: Sabine Steinbach, Dr. Jon Walmsley.

Since I forgot about this and couldn’t find the emai later, I got another one:

VIA EMAIL:

 

Demand for Immediate Take-Down: Notice of Infringing Activity

 

Date:      April 24, 2015

URL:      filepost.com/

 

Dear Sir or Madam,

 

We have been made aware that the domain listed above, which appears to be on servers under your control, is offering unlicensed copies of, or is engaged in other unauthorized activities relating to copyrighted works published by, Wiley-VCH Verlag GmbH & Co. KGaA.

 

  1. Identification of copyrighted work(s):

 

Copyrighted work(s):

Edwards: Human genetic diversity: Lewontin’s fallacy, BioEssays 25:798–801,  2003 Wiley Periodicals, Inc.

 

Copyright owner or exclusive licensee:

Wiley-VCH Verlag GmbH & Co. KGaA.

 

  1. Copyright infringing material or activity found at the following location(s):

 

emilkirkegaard.dk/en/wp-content/uploads/A.W.F.-Edwards-Human-genetic-diversity-Lewontin%E2%80%99s-fallacy.pdf

 

The above copyrighted work(s) is being made available for copying, through downloading, at the above location without authorization of the copyright owner(s) or exclusive licensee.

 

  1. Statement of authority:

 

The information in this notice is accurate, and I hereby certify under penalty of perjury that I am authorized to act on behalf of, Wiley-VCH Verlag GmbH & Co. KGaA., the owner or exclusive licensee of the copyright(s) in the work(s) identified above. I have a good faith belief that none of the materials or activities listed above have been authorized by, Wiley-VCH Verlag GmbH & Co. KGaA., its agents, or the law.

 

We hereby give notice of these activities to you and request that you take expeditious action to remove or disable access to the material described above, and thereby prevent the illegal reproduction and distribution of this copyrighted work(s) via your company’s network.

 

We appreciate your cooperation in this matter. Please advise us regarding what actions you take.

 

Yours sincerely,

 

Bettina Loycke
Senior Rights Manager
Rights & Licenses

Wiley-VCH Verlag GmbH & Co. KGaA
Boschstraße 12
69469 Weinheim
Germany

www.wiley-vch.de

T          +(49) 6201 606-280
F          +(49) 6201 606-332
rightsDE@wiley.com

 

My name, typed above, constitutes an electronic signature under Federal law, and is intended to be binding.

 

Deutsch:
Wiley-VCH Verlag GmbH & Co. KGaA – A company of John Wiley & Sons, Inc. – Sitz der Gesellschaft: Weinheim – Amtsgericht Mannheim, HRB 432833 – Vorsitzender des Aufsichtsrates:
Stephen Michael Smith. Persönlich haftender Gesellschafter: John Wiley & Sons GmbH – Sitz der Gesellschaft: Weinheim – Amtsgericht Mannheim, HRB 432296 – Geschäftsführer: Sabine Steinbach, Dr. Jon Walmsley.

English:
Wiley-VCH Verlag GmbH & Co. KGaA – A company of John Wiley & Sons, Inc. – Location of the Company: Weinheim – Trade Register: Mannheim, HRB 432833.
Chairman of the Supervisory Board: Stephen Michael Smith. General Partner: John Wiley & Sons GmbH, Location: Weinheim – Trade Register Mannheim, HRB 432296 –
Managing Directors: Sabine Steinbach, Dr. Jon Walmsley.

So you want to run some code that may throw an error? This is somewhat less common with R than with e.g. php.

It is quite simple in Python:

The try statement works as follows.

  • First, the try clause (the statement(s) between the try and except keywords) is executed.
  • If no exception occurs, the except clause is skipped and execution of the try statement is finished.
  • If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement.
  • If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message as shown above.

Let’s try something simple in iPython:

In [2]: try:
   ...:     "string"/7
   ...: except:
   ...:     print("Can't divide a string")
   ...:
Can't divide a string

Simple stuff.

Now, let’s do the same in R:

tryCatch("string"/7,
         error = function(e) {
           print("Can't divide a string")
         }
)

Notice how I had to add an anonymous function? Well apparently this is how it has to be done in R. The parameter to the function, e, is not even used. It would be better if one could simply do this:

tryCatch("string"/7,
         error = {
           print("Can't divide a string")
         }
)

But no, then you get:

[1] "Can't divide a string"
Error in tryCatchOne(expr, names, parentenv, handlers[[1L]]) : 
  attempt to apply non-function

After playing War for the Overworld (unofficial DK3), I felt that I needed to play the real version. So I downloaded a version of DK2 from here. The game opens and the menus work fine, but once you get into a game, the FPS drops to unplayable levels. I tried various compatibility settings but it didn’t work. However, the advice given here works:

HOW TO FIX THE INGAME LAGGGGS !!! —— Windows 8.1

OK, even if this is not the right post here, i say it right away.

I have bought the game from Origin, installed it and the problems begun right at start. When i started the game the bullfrog logo freezed.

Just tab out in windowns one time and tab back in game and it starts running.

When i was in the menu of the game everything worked just fine. but after i started a game in the campain i had like 2 fps and horrible mouse lags.

So here is what i did and now it runs perfectly without any problems.

Go in the game menu in OPTIONS – GRAPHICS OPTION and change the resulution to 640 x 480. (maybe u have to deactived Hardware accelerations too)

You wont believe but it was that easy and the game runs fine now.

I hope this helps a few players who bought this game.

I’m just putting it here for future reference in case I forget or something else has the same problem.

In reply to www.ljzigerell.com/?p=534 and his working paper here: www.ljzigerell.com/?p=2376

We are discussing his working paper over email, and I had some reservations about his factor analysis. I decided to run the analyses I wanted myself, but it turned into a longer project which should be placed in a short paper instead of in a private email.

I fetched the data from his source. The raw data did not have variable names, so was unwieldy to work with. I opened the SPSS file, and it did have variable names. Then I exported the CSV with the desired variables (see supp. material). Then I had to recoded the variables so that the true answers are coded as 1, false answers as 0, and missing as NA. This took some time. I followed his coding procedure for most cases (see his STATE file and my R code below).

How many factors to extract

It seems that he relies on some kind of method for determining the number of factors to extract, presumably Eigenvalue>1. I always use three different methods using the nFactors package. Using all 22 variables (note that he did not this all of them at once), all methods agreed to extract 5 factors (at max). Here’s the factor solutions for extracting 1 thru 5 factors and their intercorrelations:

Factor analyses with 1-5 factors and their correlations

[1] "Factor analysis, extracting 1 factors using oblimin and MinRes"

Loadings:
         MR1   
smokheal  0.129
condrift  0.347
rmanmade  0.445
earthhot  0.348
oxyplant  0.189
lasers    0.514
atomsize  0.441
antibiot  0.401
dinosaur  0.323
light     0.384
earthsun  0.515
suntime   0.581
dadgene   0.227
getdrug   0.290
whytest   0.423
probno4   0.396
problast  0.423
probreq   0.349
probif3   0.416
evolved   0.306
bigbang   0.315
onfaith  -0.296

                 MR1
SS loadings    3.191
Proportion Var 0.145
[1] "Factor analysis, extracting 2 factors using oblimin and MinRes"

Loadings:
         MR1    MR2   
smokheal  0.121       
condrift  0.345       
rmanmade  0.368  0.136
earthhot  0.363       
oxyplant  0.172       
lasers    0.518       
atomsize  0.461       
antibiot  0.323  0.133
dinosaur  0.323       
light     0.375       
earthsun  0.587       
suntime   0.658       
dadgene   0.145  0.130
getdrug   0.211  0.130
whytest   0.386       
probno4          0.705
problast         0.789
probreq   0.162  0.305
probif3   0.108  0.514
evolved   0.348       
bigbang   0.367       
onfaith  -0.266       

                 MR1   MR2
SS loadings    2.617 1.569
Proportion Var 0.119 0.071
Cumulative Var 0.119 0.190
     MR1  MR2
MR1 1.00 0.35
MR2 0.35 1.00
[1] "Factor analysis, extracting 3 factors using oblimin and MinRes"

Loadings:
         MR2    MR1    MR3   
smokheal                     
condrift                0.346
rmanmade  0.173  0.170  0.232
earthhot         0.187  0.220
oxyplant                0.100
lasers           0.256  0.320
atomsize         0.208  0.312
antibiot  0.168  0.150  0.198
dinosaur         0.119  0.250
light            0.240  0.169
earthsun         0.737       
suntime          0.754       
dadgene   0.147              
getdrug   0.152         0.149
whytest   0.108  0.143  0.294
probno4   0.708              
problast  0.781              
probreq   0.324              
probif3   0.532              
evolved                 0.562
bigbang                 0.525
onfaith                -0.307

                 MR2   MR1   MR3
SS loadings    1.646 1.444 1.389
Proportion Var 0.075 0.066 0.063
Cumulative Var 0.075 0.140 0.204
     MR2  MR1  MR3
MR2 1.00 0.29 0.25
MR1 0.29 1.00 0.43
MR3 0.25 0.43 1.00
[1] "Factor analysis, extracting 4 factors using oblimin and MinRes"

Loadings:
         MR4    MR2    MR1    MR3   
smokheal                            
condrift  0.180                0.234
rmanmade  0.387                     
earthhot  0.262         0.102       
oxyplant  0.116                     
lasers    0.490                     
atomsize  0.435                     
antibiot  0.485                     
dinosaur  0.312                     
light     0.274         0.142       
earthsun                0.797       
suntime                 0.719       
dadgene   0.234                     
getdrug   0.273                     
whytest   0.438                     
probno4          0.695              
problast         0.817              
probreq   0.180  0.275              
probif3   0.139  0.487              
evolved                        0.685
bigbang                        0.554
onfaith  -0.141               -0.230

                 MR4   MR2   MR1   MR3
SS loadings    1.511 1.501 1.204 0.915
Proportion Var 0.069 0.068 0.055 0.042
Cumulative Var 0.069 0.137 0.192 0.233
     MR4  MR2  MR1  MR3
MR4 1.00 0.39 0.57 0.42
MR2 0.39 1.00 0.23 0.12
MR1 0.57 0.23 1.00 0.27
MR3 0.42 0.12 0.27 1.00
[1] "Factor analysis, extracting 5 factors using oblimin and MinRes"

Loadings:
         MR2    MR1    MR3    MR5    MR4   
smokheal                                   
condrift                0.209         0.299
rmanmade  0.104                0.120  0.379
earthhot                              0.367
oxyplant                              0.220
lasers                         0.195  0.361
atomsize                       0.273  0.207
antibiot                       0.401  0.108
dinosaur                       0.204  0.131
light                                 0.423
earthsun         0.504                0.186
suntime          1.007                     
dadgene                        0.277       
getdrug                        0.373       
whytest                        0.504       
probno4   0.701                            
problast  0.816                            
probreq   0.272                0.174       
probif3   0.487                0.107       
evolved                 0.753              
bigbang                 0.483         0.165
onfaith                -0.225 -0.152       

                 MR2   MR1   MR3   MR5   MR4
SS loadings    1.501 1.291 0.919 0.874 0.871
Proportion Var 0.068 0.059 0.042 0.040 0.040
Cumulative Var 0.068 0.127 0.169 0.208 0.248
     MR2  MR1  MR3  MR5  MR4
MR2 1.00 0.20 0.11 0.38 0.28
MR1 0.20 1.00 0.21 0.41 0.44
MR3 0.11 0.21 1.00 0.32 0.30
MR5 0.38 0.41 0.32 1.00 0.50
MR4 0.28 0.44 0.30 0.50 1.00

Interpretation

We see that in the 1-factor solution, all variables load in the expected direction, and we can speak of a general scientific knowledge factor. This is the one we want to use for other analyses. We see that faith loads negatively. This variable is not a true/false question, and thus should be excluded from any actual measurement of the general scientific knowledge factor.

Increasing the number of factors to extract simply divides this general factor into correlated parts. E.g. in the 2-factor solution, we see a probability factor that correlates .35 with the remaining semi-general factor. In solution 3, we see MR2 as the probability factor, MR3 as the knowledge related to religious beliefs factor and MR1 as the remaining items. Intercorrelations are .29, .25 and .43. This pattern continues until the 5th solution which still produces 5 correlated factors: MR2 is the probability factor, MR1 is an astronomy factor, MR3 is the one having to do with religious beliefs, MR5 looks like a medicine/genetics factor, and MR4 is the rest.

Just because scree tests etc. tell you to extract >1 factor does not mean that there is no general factor. This is the old fallacy made in the study of cognitive ability. See discussion in Jensen 1998 (chapter 3). It is sometimes still made e.g. Hampshire, et al (2012). Generally, as one increases the number of variables, the suggested number of factors to extract goes up. This does not mean that there is no general factor, just that with increasing number of variables, one can see a more fine-grained structure in the data than one can with only e.g. 5 variables.

Should we use them or not?

Before discussing whether one should theoretically use them or not, one can measure if it makes much of a difference. One can do this by extracting the general factor with and without the items in questions. I did this, also excluding the onfaith item. Then I correlated the scores from these two analysis: r=.992. In other words, it hardly matters whether one includes these religious-tinged items or not. The general factor is measured quite well already without them and they do not substantially change the factor scores. However, since adding more indicator items/variables generally reduces measurement error of a latent trait/factor, I would include them in my analyses.

How many factors should we extract and use?

There is also the question of how many factors one should extract. The answer is that it depends on what one wants to do. As Zigerell points out in a review comment of this paper on Winnower:

For example, for diagnostic purposes, if we know only that students A, B, and C miss 3 items on a test of general science knowledge, then the only remediation is more science; but we can provide more tailored remediation if we have separate components so that we observe that, say, A did poorly only on the religion-tinged items, B did poorly only on the probability items, and C did poorly only on the astronomy items.

For remedial education, it is clearly preferable to extract the highest number of interpretable factors because this gives the most precise information where knowledge is lacking for a given person. In regression analysis where we want to control for scientific knowledge, one should use the general factor.

References

Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6), 1225-1237.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

Supplementary material

Datafile: science_data

R code

library(plyr) #for mapvalues

data = read.csv("science_data.csv") #load data

#Coding so that 1 = true, 0 = false
data$smokheal = mapvalues(data$smokheal, c(9,7,8,2),c(NA,0,0,0))
data$condrift = mapvalues(data$condrift, c(9,7,8,2),c(NA,0,0,0))
data$earthhot = mapvalues(data$earthhot, c(9,7,8,2),c(NA,0,0,0))
data$rmanmade = mapvalues(data$rmanmade, c(9,7,8,1,2),c(NA,0,0,0,1)) #reverse
data$oxyplant = mapvalues(data$oxyplant, c(9,7,8,2),c(NA,0,0,0))
data$lasers = mapvalues(data$lasers, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$atomsize = mapvalues(data$atomsize, c(9,7,8,2),c(NA,0,0,0))
data$antibiot = mapvalues(data$antibiot, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$dinosaur = mapvalues(data$dinosaur, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$light = mapvalues(data$light, c(9,7,8,2,3),c(NA,0,0,0,0))
data$earthsun = mapvalues(data$earthsun, c(9,7,8,2),c(NA,0,0,0))
data$suntime = mapvalues(data$suntime, c(9,7,8,2,3,1,4,99),c(0,0,0,0,1,0,0,NA))
data$dadgene = mapvalues(data$dadgene, c(9,7,8,2),c(NA,0,0,0))
data$getdrug = mapvalues(data$getdrug, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$whytest = mapvalues(data$whytest, c(1,2,3,4,5,6,7,8,9,99),c(1,0,0,0,0,0,0,0,0,NA))
data$probno4 = mapvalues(data$probno4, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$problast = mapvalues(data$problast, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$probreq = mapvalues(data$probreq, c(9,8,2),c(NA,0,0))
data$probif3 = mapvalues(data$probif3, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$evolved = mapvalues(data$evolved, c(9,7,8,2),c(NA,0,0,0))
data$bigbang = mapvalues(data$bigbang, c(9,7,8,2),c(NA,0,0,0))
data$onfaith = mapvalues(data$onfaith, c(9,1,2,3,4,7,8),c(NA,1,1,0,0,0,0))

#How many factors to extract?
library(nFactors)
nScree(data[complete.cases(data),]) #use complete cases only

#extract factors
library(psych) #for factor analysis
for (num in 1:5) {
  print(paste0("Factor analysis, extracting ",num," factors using oblimin and MinRes"))
  fa = fa(data,num) #extract factors
  print(fa$loadings) #print
  if (num>1){ #print factor cors
    phi = round(fa$Phi,2) #round to 2 decimals
    colnames(phi) = rownames(phi) = colnames(fa$scores) #set names
    print(phi) #print
  }
}

#Does it make a difference?
fa.all = fa(data[1:21]) #no onfaith
fa.noreligious = fa(data[1:19]) #no onfaith, bigbang, evolved
cor(fa.all$scores,fa.noreligious$scores, use="pair") #correlation, ignore missing cases

So I was installing something (forgot what), and had problems with RCurl:

* installing *source* package ‘RCurl’ …
** package ‘RCurl’ successfully unpacked and MD5 sums checked
checking for curl-config… no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
* removing ‘/home/lenovo/R/x86_64-pc-linux-gnu-library/3.1/RCurl’
Warning in install.packages :
installation of package ‘RCurl’ had non-zero exit status

Solution was to do sudo apt-get install libcurl4-openssl-dev (found in a comment here).

Getting a percentage table from a dataframe

A reviewer asked me to:

1) As I said earlier, there should be some data on the countries of origin of the immigrant population. Most readers have no idea who actually moves to Denmark. At the very least, there should be basic information like “x% of the immigrant population is of non-European origin and y% of European origin as of 2014.” Generally, non-European immigration would be expected to increase inequality more, given that IQ levels are relatively uniform across Europe.

I have population counts for each year 1980 through 2014 in a dataframe and I’d like to get them as a percent of each year so as to get the relative sizes of the countries. There is a premade function for this, prop.table, however, it works quite strangely. If one gives it a dataframe and no margin, it will use the total sum of the data.frame instead of by column. This is sometimes useful, but not in this case. However, if one gives it a data.frame and margin=2, it will complain that:

Error in margin.table(x, margin) : 'x' is not an array

Which is odd when it just accepted it before. The relatively lack of documentation made it not quite easy to figure out how to make it work. Turns out that one just has to convert the dataframe to a matrix when giving it:

census.percent = prop.table(as.matrix(census), margin=2)

and then one can convert it back and also multiple by 100 to get percent instead of fractions:

census.percent = as.data.frame(prop.table(as.matrix(census), margin=2)*100)

Getting the top 10 countries with names for selected years

This one was harder. Here’s the code I ended up with:

selected.years = c("X1980","X1990","X2000","X2010","X2014") #years of interest
for (year in selected.years){ #loop over each year of interest
  vector = census.percent[,year,drop=FALSE] #get the vector, DONT DROP!
  View(round(vector[order(vector, decreasing = TRUE),,drop=FALSE][1:10,,drop=FALSE],1)) #sort vector, DONT drop! and get 1:10 and DONT DROP!
}

First we choose the years we want (note that X goes in front because R has trouble handling columns that begin with a number). Then we loop over each year of interest. Then we pick it out to avoid having to select the same column over and over. However, normally when picking out 1 column from a dataframe, R will convert it to numeric, which is very bad because this removes the rownames. That means that even tho we can find the top 10 countries, we don’t know which ones they are. The solution for this is to set drop=FALSE. The next part consists of first ordering the vector (without drop!), and then selecting the top 10 countries without dropping. I open them in View (in Rstudio) because this makes it easier to copy the values for further use (e.g. in a table for a paper).

So, drop=FALSE is another one of those pesky small things to remember. It is just like stringsAsFactors=FALSE when using read.table (or read.csv).

 

For a mathematical explanation of the test, see e.g. here. However, such an explanation is not very useful for using the test in practice. Just what does a W value of .95 mean? What about .90 or .99? One way to get a feel for it, is to simulate datasets, plot them and calculate the W values. Additionally, one can check the sensitivity of the test, i.e. the p value.

All the code is in R.

#random numbers from normal distribution
set.seed(42) #for reproducible numbers
x = rnorm(5000) #generate random numbers from normal dist
hist(x,breaks=50, main="Normal distribution, N=5000") #plot
shapiro.test(x) #SW test
>W = 0.9997, p-value = 0.744

SW_norm

So, as expected, W was very close to 1, and p was large. In other words, SW did not reject a normal distribution just because N is large. But maybe it was a freak accident. What if we were to repeat this experiment 1000 times?

#repeat sampling + test 1000 times
Ws = numeric(); Ps = numeric() #empty vectors
for (n in 1:1000){ #number of simulations
  x = rnorm(5000) #generate random numbers from normal dist
  sw = shapiro.test(x)
  Ws = c(Ws,sw$statistic)
  Ps = c(Ps,sw$p.value)
}
hist(Ws,breaks=50) #plot W distribution
hist(Ps,breaks=50) #plot P distribution
sum(Ps<.05) #how many Ps below .05?

The number of Ps below .05 was in fact 43, or 4.3%. I ran the code with 100,000 simulations too, which takes 10 minutes or something. The value was 4389, i.e. 4.4%. So it seems that the method used to estimate the P value is slightly off in that the false positive rate is lower than expected.

What about the W statistic? Is it sensitive to fairly small deviations from normality?

#random numbers from normal distribution, slight deviation
x = c(rnorm(4900),rnorm(100,2))
hist(x,breaks=50, main="Normal distribution N=4900 + normal distribution N=200, mean=2")
shapiro.test(x)
>W = 0.9965, p-value = 1.484e-09


Here I started with a very large norm. dist. and added a small norm dist. to it with a different mean. The difference is hardly visible to the eye, but the P value is very small. The reason is that the large sample size makes it possible to detect even very small deviations from normality. W was again very close to 1, indicating that the distribution was close to normal.

What about a decidedly non-normal distribution?

#random numbers between -10 and 10
x = runif(5000, min=-10, max=10)
hist(x,breaks=50,main="evenly distributed numbers [-10;10], N=5000")
shapiro.test(x)
>W = 0.9541, p-value < 2.2e-16

 

SW_even

SW wisely rejects this with great certainty as being normal. However, W is near 1 still (.95). This tells us that the W value does not vary very much even when the distribution is decidedly non-normal. For interpretation then, we should probably bark when W drops just under .99 or so.

As a further test of the W values, here’s two equal sized distributions plotted together.

#normal distributions, 2 sd apart (unimodal fat normal distribution)
x = c(rnorm(2500, -1, 1),rnorm(2500, 1, 1))
hist(x,breaks=50,main="Mormal distributions, 2 sd apart")
shapiro.test(x)
>W = 0.9957, p-value = 6.816e-11
sd(x)
>1.436026

SW_norm3 It still looks fairly normal, altho too fat. The standard deviation is in fact 1.44, or 44% larger than it is supposed to be. The W value is still fairly close to 1, however, and only a little less than from the distribution that was only slightly nonnormal (Ws = .9957 and .9965). What about clearly bimodal distributions?

#bimodal normal distributions, 4 sd apart
x = c(rnorm(2500, -2, 1),rnorm(2500, 2, 1))
hist(x,breaks=50,main="Normal distributions, 4 sd apart")
shapiro.test(x)
>W = 0.9464, p-value < 2.2e-16

SW_norm4

This clearly looks nonnormal. SW rejects it rightly and W is about .95 (W=0.9464). This is a bit lower than for the evenly distributed numbers. (W=0.9541)

What about an extreme case of nonnormality?

#bimodal normal distributions, 20 sd apart
x = c(rnorm(2500, -10, 1),rnorm(2500, 10, 1))
hist(x,breaks=50,main="Normal distributions, 20 sd apart")
shapiro.test(x)
>W = 0.7248, p-value < 2.2e-16

SW_norm5

Finally we make a big reduction in the W value.

What about some more moderate deviations from normality?

#random numbers from normal distribution, moderate deviation
x = c(rnorm(4500),rnorm(500,2))
hist(x,breaks=50, main="Normal distribution N=4500 + normal distribution N=500, mean=2")
shapiro.test(x)
>W = 0.9934, p-value = 1.646e-14

SW_norm6

This one has a longer tail on the right side, but it still looks fairly normal. W=.9934.

#random numbers from normal distribution, large deviation
x = c(rnorm(4000),rnorm(1000,2))
hist(x,breaks=50, main="Normal distribution N=4000 + normal distribution N=1000, mean=2")
shapiro.test(x)
>W = 0.991, p-value < 2.2e-16

SW_norm7

This one has a very long right tail. W=.991.

In conclusion

Generally we see that given a large sample, SW is sensitive to departures from non-normality. If the departure is very small, however, it is not very important.

We also see that it is hard to reduce the W value even if one deliberately tries. One needs to test extremely non-normal distribution in order for it to fall appreciatively below .99.

Cengage Learning
27500 Drake Road
Farmington Hills, MI 48331
 
 
Tuesday, February 11, 2014
 
 
RE: Unauthorized Use of Cengage Learning Material
 
In reference to the following Cengage Learning product(s): An Introduction to Language 9th Edition by Victoria Fromkin
 
Dear Sir:
 
It has been brought to our attention that material belonging to a Cengage Learning company has been used without acquiring permission. Copyrighted material from the title listed above is posted to the following unprotected URL:emilkirkegaard.dk/en/wp-content/uploads/Victoria-Fromkin-Robert-Rodman-Nina-Hyams-An-Introduction-to-Language.pdf
 
We can find no records to indicate that Cengage Learning granted you permission to reproduce its material to a publically accessible website. As such, we ask that you remove this material immediately and confirm that you have done so.
 
This letter is strictly without waiver of, or prejudice to, our rights, claims or remedies, all of which are hereby expressly reserved.
 
Sincerely,
 
Heather Ungarten
Infringement and Anti-Piracy Paralegal
Cengage Learning
27500 Drake Rd., Farmington Hills, MI 48331