Previous | Home

 

R: Exercise 7

This exercise is part of a collection of exercises involving the statistics software package R.

7. Statistical tests in R

In order to perform this exercise you need to be familiar with the lexdec dataset. If this is not the case then please take a look at exercise 6, especially section 6.1.

7.1. Basic significance tests

In exercise 6, we used the tapply function in order to find out the average reaction times for native and nonnative speakers while giving correct and incorrect answers:

   > tapply(lexdec$RT,list(lexdec$NativeLanguage,lexdec$Correct),mean)
            correct incorrect
   English 6.322063  6.194966
   Other   6.468275  6.580792

We obtained four different numbers. The question was if the differences between the numbers are significant. R offers functions for finding out if this is the case.

The biggest observed difference is for incorrect answers between native speakers of English (6.194966) and nonnative speakers (6.580792). Let's take the mean of native speakers as the norm and use the reaction time values for other subjects as new observations. Are these observation significantly different from the norm? You can find this out with R's t.test function:

   > t.test(lexdec[lexdec$NativeLanguage=="Other" &
     lexdec$Correct=="incorrect",]$RT, mu = 6.194966 )
   
   One Sample t-test
   
   data:  lexdec[lexdec$NativeLanguage == "Other" & 
          lexdec$Correct == "incorrect",      ]$RT 
   t = 5.7818, df = 36, p-value = 1.360e-06
   alternative hypothesis: true mean is not equal to 6.194966 
    95 percent confidence interval:
    6.445455 6.716130 
   sample estimates:
   mean of x 
    6.580792 

This instruction compares the reaction time values of nonnative speakers (NativeLanguage==Other) for incorrect answers (Correct==incorrect) with the average of English speakers for incorrect answers (mu=6.194966). Its output contains a lot of information. The most important number is the p-value score mentioned in the middle (p-value=1.360e-06). The t-test tests the null hypothesis: that the data originates from data set with an average of 6.194966. The p-value is the confidence in accepting this hypothesis. A high value (close to 1) or a low value (close to 0) indicates that the hypothesis is rejected.

The obtained p value is very low (about one in a million) so the test is quite certain about the null hypothesis being false. The question whether the response time differences between the two groups are significant, can be answered with yes. In practice, different threshold values for p can be chosen to decide between significant differences or not, for example p<0.05, p<0.01 or p<0.001. Our p value lies below all of these thresholds.

Not all of the differences between the four averages in the table are significant:

   > t.test(lexdec[lexdec$NativeLanguage=="Other" &
     lexdec$Correct=="incorrect",]$RT, mu = 6.468275 )
   
   One Sample t-test
   
   data:  lexdec[lexdec$NativeLanguage == "Other" & 
          lexdec$Correct == "incorrect",      ]$RT 
   t = 1.6861, df = 36, p-value = 0.1004
   alternative hypothesis: true mean is not equal to 6.468275 
   95 percent confidence interval:
    6.445455 6.716130 
   sample estimates:
   mean of x 
    6.580792 

The test cannot show that the response times for correct answers by other speakers are significantly different from for their incorrect answers. The corresponding p value is 0.1004 and this is higher than any of the thresholds we mentioned earlier.

There is a catch here: the t test only works well for data that is distributed normally. We should have checked first if the values in the lists that we feed to the t test, have a normal distribution. We can get an idea about this by sorting the values and plotting their density. We hope to obtain the characteristic plot of the normal distribution:

   > plot(density(sort(lexdec[lexdec$NativeLanguage=="Other" &
     lexdec$Correct=="incorrect",]$RT)))

Unfortunately, the resulting plot does not completely look like a normal distribution. It has two long tails but its center looks more like a crater than like a hill. We are not sure if it would be appropriate to use the t test for this set of values. The Wilcoxon test can be applied for nonnormal distributions. We apply this test to our set of values just to make sure that the difference is indeed significant:

   > wilcox.test(lexdec[lexdec$NativeLanguage=="Other" &
     lexdec$Correct=="incorrect",]$RT, mu = 6.194966 )
   
           Wilcoxon signed rank test with continuity correction
   
   data:  lexdec[lexdec$NativeLanguage == "Other" &
          lexdec$Correct == "incorrect",      ]$RT 
   V = 621, p-value = 4.944e-05
   alternative hypothesis: true location is not equal to 6.194966 

The p value remains very small. The Wilcoxon test confirms that the the values are different from the supplied mean.

7.2. Correlation between sets of numbers

Often we want to find out whether two sets of values are correlated. Two sets of values (variables) are said to be correlated if knowledge of of one value enables us to predict approximately what the corresponding value in the other set will be. An example of this are heights and weights of people. Tall people are usually heavy while small people are usually light.

Correlation is expressed with the letter r. It can take any value between -1 and 1 (inclusive). An r value of 1 corresponds with a strong positive correlation, -1 corresponds with a strong negative correlation while an r value of 0 means that the two variables are uncorrelated. In R you can use the cor function for computing r values:

   > cor(lexdec$Trial,lexdec$meanRT)
   [1] -0.004461687
   > cor(lexdec$Frequency,lexdec$meanRT)
   [1] -0.6250747

This shows that the columns Trial (numeric id of experiment) and meanRT (mean reaction time) are uncorrelated while there is a negative correlation between Frequency (frequency of word in a big text) and meanRT. Let's examine plots of the two pairs of variables:

   > plot(lexdec$Trial,lexdec$meanRT)
   > plot(lexdec$Frequency,lexdec$meanRT)

In the first plot, many of the data points are at the left side of the graph. Otherwise, their locations seem to be quite random. For example, it is hard to predict the y values from the x values. These two data sets are indeed uncorrelated. However, in the second plot there seems to be a clear relation between the two data sets: higher frequency words in general correspond with shorter response times. These two data sets are correlated.

7.3. Summarizing correlated data

A relation between two data sets can be summarized with a model. A popular model for numeric data is the linear model. This model assumes that every value of the first data set corresponds with exactly one value of the other data set. It also assumes that the relation between the two variables is linear, that is that it can be represented with a function like y = s*x + i where x represents the values of the first data set and y the values of the second.

Let's display data points together with a linear model for the points:

   > plot(lexdec$Frequency,lexdec$meanRT)
   > abline(lm(data=lexdec , meanRT ~ Frequency))

The first instruction plots the average response time (meanRT) against the frequency of the words (Frequency). On the second line, there are two instructions. The lm instruction creates a linear model for the average response time compared to the frequency of the current word. This model is drawn in the graph with the command abline. When you run the lm command by itself, you will see the internal representation of the linear model. It consists of two numbers representing a line: the intercept, the position where the line intersects the y axis, and the slope, the amount of y points that the line raises for every extra x point.

A linear model works well for summarizing the relation between the data sets because they are correlated. Linear models do not work very well for uncorrelated data. Also, data might be related but the relation might not be linear. This is also the case for this data set: when we summarize the data with a smoothing function, we do not obtain a straight line:

   > plot(lexdec$Frequency,lexdec$meanRT)
   > abline(lm(data=lexdec , meanRT ~ Frequency))
   > lines(lowess(lexdec$Frequency,lexdec$meanRT),lty="dashed")

The lowess function creates local summaries of sets of data points. When we print its output as a line (the dashed line in the plot), we see that the line is not straight. So the relation between the two data sets is not completely linear. However the linear model remains a good and simple method for summarizing the relation between these two columns.

7.4. Exercises

  1. Find out what the average response times are for the words apple and vulture. Is the response time for vulture significantly (p<0.01) slower than that of apple?
  2. Compute the correlation r between first the values in the columns meanRT and SubjFreq, and then between the columns meanRT and meanSize. Draw point graphs for the two pairs of columns. Do the graphs correspond with what you could expect from the correlation coefficient r?
  3. Make a linear model for the relation between the columns meanRT and SubjFreq. What is its slope and what is its intercept? Draw the data points of the two columns together with the linear model. Add a dashed line representing the lowess smoothing of the data points to the plot. Does it correspond with a straight line?

Please continue with the case studies when you are finished with this exercise.

If you want to learn even more about R, you can take a look at Harald Baayen's book "Analyzing Linguistic Data: A practical introduction to Statistics" (pdf). The book contains several examples of R which you can try out. The data files mentioned in the book are available in the directory /storage3/data/erikt/r . Please note that these files are in a different format than used in this exercise (rda). For loading them in R you need the command load(), for example load("/storage3/data/erikt/r/verbs.rda"). This will load the data of the file and store it in a table verbs.


Previous | Home | Answers
Last update: March 24, 2011. erikt(at)xs4all.nl