In this section we will clean the data from an acceptability judgment task in Spanish run with IbexFarm. You can download the data here. This file contains data from 39 participants and there were 24 experimental items and 48 filler items. The experimental items follow a 2x2 factorial design: “dependency” (wh vs. local) and “length” (short vs. long).
Note that the purpose of this tutorial is not to analyze the data. I will focus on what the output file looks like and what it should look like to be analyzed. I will explain all the steps that need to be followed in R.
We need to load the following packages. If you don’t have them installed in your computer, you can do so using the install.packages()
function.
library(plyr)
Next, we will load our data. In the read.table()
function, we must specify that:
head=FALSE
),sep=","
), andfill=TRUE
). If we don’t do this, R won’t be able to read our file.dat <- read.table("data/results.txt", head=FALSE, sep=",", fill=TRUE)
write.csv(dat,"data/results.csv", row.names=FALSE)
Right now our columns don’t have names:
colnames(dat)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9"
We can rename these columns so that we can work with out data frame more intuitively:
colnames(dat) <- c("subj_ID", "MD5", "controller", "item_abs", "element", "type", "item",
"sentence", "rating")
colnames(dat)
## [1] "subj_ID" "MD5" "controller" "item_abs" "element"
## [6] "type" "item" "sentence" "rating"
We also need to get rid of all rows containing NAs
:
dat <- na.omit(dat)
In our intro HTML document, we may have added some questionnaires to gather information about our participants (age, gender, native language, etc.). We need to have that information in columns in our data frame. However, IbexFarm provides that information at the beginning of each participant. We can easily extract the information into a vector and paste it in our data frame. Instead of doing this individually for each item in our questionnaire, we can create a function to do so more automatically. I’ve named it addcolumn
.
addcolumn <- function(fieldname) {
repnumber<-count(dat, vars="subj_ID")[2,2]
x <- droplevels(subset(dat, dat$sentence == fieldname))
x <- x$rating
y <- rep(x,each=repnumber)
return(y)
}
Now we can add the new columns with the information from the questionnaire:
dat$age <- as.numeric(paste(addcolumn("age")))
dat$sex <- as.factor(paste(addcolumn("sex")))
dat$spain <- as.factor(paste(addcolumn("Spain")))
dat$spanish <- as.factor(paste(addcolumn("Spanish")))
dat$consent <- as.factor(paste(addcolumn("consent")))
If we have a look at the structure of the new data frame, we’ll see that the new variables have been added:
str(dat)
## 'data.frame': 8151 obs. of 14 variables:
## $ subj_ID : Factor w/ 40 levels "1550858942","1550859288",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ MD5 : Factor w/ 3473 levels "10004","10010",..: 3008 3008 3008 3008 3008 3008 3008 3008 3008 3008 ...
## $ controller: Factor w/ 3 levels "","AcceptabilityJudgment",..: 3 3 3 3 3 2 2 2 2 2 ...
## $ item_abs : int 0 0 0 0 0 5 5 2 2 4 ...
## $ element : int 0 0 0 0 0 0 0 0 0 0 ...
## $ type : Factor w/ 9 levels "","filler_gram",..: 4 4 4 4 4 7 7 7 7 7 ...
## $ item : Factor w/ 26 levels "","1","10","11",..: 26 26 26 26 26 26 26 26 26 26 ...
## $ sentence : Factor w/ 187 levels "","¿A cuál de los camareros que trabajan en la cervecería han visto borracho por la calle Dato?",..: 146 154 182 183 179 184 175 87 175 85 ...
## $ rating : Factor w/ 34 levels "","1","19","2",..: 10 34 33 33 30 1 27 1 2 1 ...
## $ age : num 27 27 27 27 27 27 27 27 27 27 ...
## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
## $ spain : Factor w/ 1 level "Si": 1 1 1 1 1 1 1 1 1 1 ...
## $ spanish : Factor w/ 2 levels "No","Si": 2 2 2 2 2 2 2 2 2 2 ...
## $ consent : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int 8 11 14 17 20 23 26 29 32 35 ...
## ..- attr(*, "names")= chr "8" "11" "14" "17" ...
When creating the HTML file, I coded some of the answers in Spanish:
levels(dat$spain)
## [1] "Si"
levels(dat$spanish)
## [1] "No" "Si"
This is not a problem for us. However, I’m gonna translate them into English in case we want to share our data:
dat$spain <- as.factor(ifelse(dat$spain=="Si","yes","no"))
dat$spanish <- as.factor(ifelse(dat$spanish=="Si","yes","no"))
levels(dat$spain)
## [1] "yes"
levels(dat$spanish)
## [1] "no" "yes"
We can know delete the rows we are not interested in. We are going to delete the rows containing:
dat <- droplevels(subset(dat, type != "intro" & type != "practice" &
sentence == "NULL"))
We are going to add a new column naming our participants (S1, S2 … Sn):
subj <- length(levels(as.factor(dat$subj_ID)))
resp <- count(dat$subj_ID)[1,2]
dat$subject <- as.factor(paste("S", rep(1:subj, each=resp), sep=""))
As a sanity check, we can make sure each subject has only one response per item. If we have more than one response per item or non, we should try to find out why.
xtabs(~ subject + item, dat)
## item
## subject 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 3 4 5 6 7
## S1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S17 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S23 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S24 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S27 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S28 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S29 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S30 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S34 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S35 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S36 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S37 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S38 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S39 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## S9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## item
## subject 8 9 NULL
## S1 1 1 72
## S10 1 1 72
## S11 1 1 72
## S12 1 1 72
## S13 1 1 72
## S14 1 1 72
## S15 1 1 72
## S16 1 1 72
## S17 1 1 72
## S18 1 1 72
## S19 1 1 72
## S2 1 1 72
## S20 1 1 72
## S21 1 1 72
## S22 1 1 72
## S23 1 1 72
## S24 1 1 72
## S25 1 1 72
## S26 1 1 72
## S27 1 1 72
## S28 1 1 72
## S29 1 1 72
## S3 1 1 72
## S30 1 1 72
## S31 1 1 72
## S32 1 1 72
## S33 1 1 72
## S34 1 1 72
## S35 1 1 72
## S36 1 1 72
## S37 1 1 72
## S38 1 1 72
## S39 1 1 72
## S4 1 1 72
## S5 1 1 72
## S6 1 1 72
## S7 1 1 72
## S8 1 1 72
## S9 1 1 72
If we have a look at our data frame using the head
function, there is a variable called “type” in which we have the conditions specified: filler_gram, filler_ungram, wh_long, wh_short, local_long, local_short.
head(dat)
## subj_ID MD5 controller
## 25 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 28 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 31 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 34 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 37 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 40 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## item_abs element type item sentence rating age sex spain
## 25 117 0 filler_gram NULL NULL 7 27 female yes
## 28 136 0 filler_ungram NULL NULL 3 27 female yes
## 31 82 0 wh_short 4 NULL 5 27 female yes
## 34 105 0 filler_gram NULL NULL 7 27 female yes
## 37 164 0 filler_ungram NULL NULL 4 27 female yes
## 40 156 0 filler_ungram NULL NULL 3 27 female yes
## spanish consent subject
## 25 yes yes S1
## 28 yes yes S1
## 31 yes yes S1
## 34 yes yes S1
## 37 yes yes S1
## 40 yes yes S1
We need to code this information into three new columns. First, we are going to add a column specifying whether a particular item is a filler or an experimental sentence. This way, we can subset our data frame and analyze either filler or experimental items.
dat$exp <- as.factor(ifelse(dat$type=="filler_gram", "filler",
ifelse(dat$type=="filler_ungram", "filler",
"exp")))
Then, we are going to add two columns to code our two factors: “dependency” (wh vs. local) and “length” (short vs. long):
dat$dependency <- as.factor(ifelse(dat$type=="wh_short", "wh",
ifelse(dat$type=="wh_long", "wh",
ifelse(dat$type=="local_short","local",
ifelse(dat$type=="local_long","local",
NA)))))
dat$length <- as.factor(ifelse(dat$type=="wh_short", "short",
ifelse(dat$type=="wh_long", "long",
ifelse(dat$type=="local_short","short",
ifelse(dat$type=="local_long","long",
NA)))))
We can also add a new column with trial number information:
dat$trial <- as.factor(seq(1,resp))
There are different approaches to analyze Likert-scale data. An approach that’s been widely used in experimental syntax is fitting a linear mixed model. Our dependent variable is ordinal because we have 7 ordered categories (from 1 to 7) and the distance between categories is unknown. The problem is that we cannot fit a linear mixed model with an ordinal dependent variable. Thus, we need to transform our data into z-scores.
The population z-score of a measurement x is: \[z=\frac{x-\mu}{\sigma}\] where \(\mu\) is the population mean and \(\sigma\) is the population standard deviation.
The sample z-score of a measurement x is:
\[z=\frac{x-\bar{x}}{s}\] where \(\bar{x}\) is the sample mean and \(s\) is the sample standard deviation.
In short, a z-score tells us how many standard deviations a data point deviates from the sample mean. For example, a z-score of 1.6 means that particular data point is 1.6 standard deviations above the mean, whereas a z-score of -0.5 means that data point is 0.5 standard deviations below the mean.
When we transform our data into z-scores, our dependent variable is now continuous and centered around 0. In our case, we are going to calculate z-scores by subject. That is, we are going to calculate the mean and standard deviation of each subject, and use these values to calculate z-scores for each subject individually. This approach eliminates individual scale biases and makes data more comparable between subjects. Participants behave ideosyncratically: some participants only use extreme values when rating sentences (1 and 7), others never use extreme values and stick to intermediate ratings (3, 4, 5), and others use the whole scale. By computing z-scores taking into account each participant’s mean and standard deviation, individual ratings can now be compared in the same scale.
Even though this seems complicated, we can easily compute z-scores by subject and store them in a new column called “z.score.rating” with the following code:
dat$z.score.rating <- ave(as.numeric(dat$rating), dat$subject, FUN=scale)
Now we have all the information we need in our data frame, but it looks kind of messy:
head(dat)
## subj_ID MD5 controller
## 25 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 28 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 31 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 34 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 37 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 40 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## item_abs element type item sentence rating age sex spain
## 25 117 0 filler_gram NULL NULL 7 27 female yes
## 28 136 0 filler_ungram NULL NULL 3 27 female yes
## 31 82 0 wh_short 4 NULL 5 27 female yes
## 34 105 0 filler_gram NULL NULL 7 27 female yes
## 37 164 0 filler_ungram NULL NULL 4 27 female yes
## 40 156 0 filler_ungram NULL NULL 3 27 female yes
## spanish consent subject exp dependency length trial z.score.rating
## 25 yes yes S1 filler <NA> <NA> 1 1.52216717
## 28 yes yes S1 filler <NA> <NA> 2 -0.43272311
## 31 yes yes S1 exp wh short 3 0.54472203
## 34 yes yes S1 filler <NA> <NA> 4 1.52216717
## 37 yes yes S1 filler <NA> <NA> 5 0.05599946
## 40 yes yes S1 filler <NA> <NA> 6 -0.43272311
First, we our going to transform all the NULL
cells into NAs
:
dat[dat == "NULL"] = NA
Then, we are going to get rid of the columns we don’t need anymore:
dat <- droplevels(subset(dat, select=-c(subj_ID, MD5, item_abs,
element, controller, sentence)))
We are going to change the order of columns to make it clearer:
colnames(dat)
## [1] "type" "item" "rating" "age"
## [5] "sex" "spain" "spanish" "consent"
## [9] "subject" "exp" "dependency" "length"
## [13] "trial" "z.score.rating"
dat <- dat[, c(9, 4, 5, 6, 7, 8, 1, 13, 10, 2, 11, 12, 3, 14)]
colnames(dat)
## [1] "subject" "age" "sex" "spain"
## [5] "spanish" "consent" "type" "trial"
## [9] "exp" "item" "dependency" "length"
## [13] "rating" "z.score.rating"
And that’s it! Our data frame is ready to be analyzed. We can have a look at the first rows to see what it looks like:
head(dat)
## subject age sex spain spanish consent type trial exp
## 25 S1 27 female yes yes yes filler_gram 1 filler
## 28 S1 27 female yes yes yes filler_ungram 2 filler
## 31 S1 27 female yes yes yes wh_short 3 exp
## 34 S1 27 female yes yes yes filler_gram 4 filler
## 37 S1 27 female yes yes yes filler_ungram 5 filler
## 40 S1 27 female yes yes yes filler_ungram 6 filler
## item dependency length rating z.score.rating
## 25 <NA> <NA> <NA> 7 1.52216717
## 28 <NA> <NA> <NA> 3 -0.43272311
## 31 4 wh short 5 0.54472203
## 34 <NA> <NA> <NA> 7 1.52216717
## 37 <NA> <NA> <NA> 4 0.05599946
## 40 <NA> <NA> <NA> 3 -0.43272311
We can also save it as .csv file in case we want to share it with other researchers or upload it to an online repository:
write.csv(dat,"data/results_clean.csv", row.names=FALSE)
Now that we have our data frame, we can have a look at the participants’ information. We can calculate the mean age, standard deviation and age range:
round(mean(dat$age),2)
## [1] 32.64
round(sd(dat$age),2)
## [1] 9.92
range(dat$age)
## [1] 19 60
We can also classify participants depending on gender:
gender<-subset(dat, dat$item=="1")
xtabs(~ sex, gender)
## sex
## female male
## 27 12
We can now check whether they are native speakers. First, we subset those who meet the two criteria we set in our HTML form:
datsub <- droplevels(subset(dat, spanish=="yes" & spain=="yes"))
This is the total number of participants:
nlevels(dat$subject)
## [1] 39
And the total number of participants who are native speakers accoridng to our criteria:
nlevels(datsub$subject)
## [1] 36
This is the difference, i.e. participants that are not native speakers and will therefore be removed from the analysis:
nlevels(dat$subject)-nlevels(datsub$subject)
## [1] 3