In this section we will clean the data from an acceptability judgment task in Spanish run with IbexFarm. You can download the data here. This file contains data from 39 participants and there were 24 experimental items and 48 filler items. The experimental items follow a 2x2 factorial design: “dependency” (wh vs. local) and “length” (short vs. long).

Note that the purpose of this tutorial is not to analyze the data. I will focus on what the output file looks like and what it should look like to be analyzed. I will explain all the steps that need to be followed in R.

Getting started

We need to load the following packages. If you don’t have them installed in your computer, you can do so using the install.packages() function.

library(plyr)

Next, we will load our data. In the read.table()function, we must specify that:

  • there are no headers (head=FALSE),
  • columns are separated by commas (sep=","), and
  • additional cells will be added to our data frame if there is an uneven number of columns by row (fill=TRUE). If we don’t do this, R won’t be able to read our file.
dat <- read.table("data/results.txt", head=FALSE, sep=",", fill=TRUE)
write.csv(dat,"data/results.csv", row.names=FALSE)

Cleaning the data

Right now our columns don’t have names:

colnames(dat)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9"

We can rename these columns so that we can work with out data frame more intuitively:

colnames(dat) <- c("subj_ID", "MD5", "controller", "item_abs", "element", "type", "item",
                   "sentence", "rating")
colnames(dat)
## [1] "subj_ID"    "MD5"        "controller" "item_abs"   "element"   
## [6] "type"       "item"       "sentence"   "rating"

We also need to get rid of all rows containing NAs:

dat <- na.omit(dat)

Adding participants’ information

In our intro HTML document, we may have added some questionnaires to gather information about our participants (age, gender, native language, etc.). We need to have that information in columns in our data frame. However, IbexFarm provides that information at the beginning of each participant. We can easily extract the information into a vector and paste it in our data frame. Instead of doing this individually for each item in our questionnaire, we can create a function to do so more automatically. I’ve named it addcolumn.

addcolumn <- function(fieldname) {
  repnumber<-count(dat, vars="subj_ID")[2,2]
  x <- droplevels(subset(dat, dat$sentence == fieldname))
  x <- x$rating
  y <- rep(x,each=repnumber)
  return(y)
}

Now we can add the new columns with the information from the questionnaire:

dat$age <- as.numeric(paste(addcolumn("age")))
dat$sex <- as.factor(paste(addcolumn("sex")))
dat$spain <- as.factor(paste(addcolumn("Spain")))
dat$spanish <- as.factor(paste(addcolumn("Spanish")))
dat$consent <- as.factor(paste(addcolumn("consent")))

If we have a look at the structure of the new data frame, we’ll see that the new variables have been added:

str(dat)
## 'data.frame':    8151 obs. of  14 variables:
##  $ subj_ID   : Factor w/ 40 levels "1550858942","1550859288",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ MD5       : Factor w/ 3473 levels "10004","10010",..: 3008 3008 3008 3008 3008 3008 3008 3008 3008 3008 ...
##  $ controller: Factor w/ 3 levels "","AcceptabilityJudgment",..: 3 3 3 3 3 2 2 2 2 2 ...
##  $ item_abs  : int  0 0 0 0 0 5 5 2 2 4 ...
##  $ element   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ type      : Factor w/ 9 levels "","filler_gram",..: 4 4 4 4 4 7 7 7 7 7 ...
##  $ item      : Factor w/ 26 levels "","1","10","11",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ sentence  : Factor w/ 187 levels "","¿A cuál de los camareros que trabajan en la cervecería han visto borracho por la calle Dato?",..: 146 154 182 183 179 184 175 87 175 85 ...
##  $ rating    : Factor w/ 34 levels "","1","19","2",..: 10 34 33 33 30 1 27 1 2 1 ...
##  $ age       : num  27 27 27 27 27 27 27 27 27 27 ...
##  $ sex       : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ spain     : Factor w/ 1 level "Si": 1 1 1 1 1 1 1 1 1 1 ...
##  $ spanish   : Factor w/ 2 levels "No","Si": 2 2 2 2 2 2 2 2 2 2 ...
##  $ consent   : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  8 11 14 17 20 23 26 29 32 35 ...
##   ..- attr(*, "names")= chr  "8" "11" "14" "17" ...

When creating the HTML file, I coded some of the answers in Spanish:

levels(dat$spain)
## [1] "Si"
levels(dat$spanish)
## [1] "No" "Si"

This is not a problem for us. However, I’m gonna translate them into English in case we want to share our data:

dat$spain <- as.factor(ifelse(dat$spain=="Si","yes","no"))
dat$spanish <- as.factor(ifelse(dat$spanish=="Si","yes","no"))

levels(dat$spain)
## [1] "yes"
levels(dat$spanish)
## [1] "no"  "yes"

We can know delete the rows we are not interested in. We are going to delete the rows containing:

  • the participants’ information (age, gender, etc.) because we already have coded that information in columns.
  • the ratings of the practice items because they are uninformative.
  • the second row of each response. IbexFarm provides two rows of information per response, but we are only interested in the one containing the rating.
dat <- droplevels(subset(dat, type != "intro" & type != "practice" &
                           sentence == "NULL"))

We are going to add a new column naming our participants (S1, S2 … Sn):

subj <- length(levels(as.factor(dat$subj_ID)))
resp <- count(dat$subj_ID)[1,2]

dat$subject <- as.factor(paste("S", rep(1:subj, each=resp), sep=""))

As a sanity check, we can make sure each subject has only one response per item. If we have more than one response per item or non, we should try to find out why.

xtabs(~ subject + item, dat)
##        item
## subject  1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23 24  3  4  5  6  7
##     S1   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S10  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S11  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S12  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S13  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S14  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S15  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S16  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S17  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S18  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S19  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S2   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S20  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S21  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S22  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S23  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S24  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S25  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S26  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S27  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S28  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S29  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S3   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S30  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S31  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S32  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S33  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S34  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S35  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S36  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S37  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S38  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S39  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S4   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S5   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S6   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S7   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S8   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     S9   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##        item
## subject  8  9 NULL
##     S1   1  1   72
##     S10  1  1   72
##     S11  1  1   72
##     S12  1  1   72
##     S13  1  1   72
##     S14  1  1   72
##     S15  1  1   72
##     S16  1  1   72
##     S17  1  1   72
##     S18  1  1   72
##     S19  1  1   72
##     S2   1  1   72
##     S20  1  1   72
##     S21  1  1   72
##     S22  1  1   72
##     S23  1  1   72
##     S24  1  1   72
##     S25  1  1   72
##     S26  1  1   72
##     S27  1  1   72
##     S28  1  1   72
##     S29  1  1   72
##     S3   1  1   72
##     S30  1  1   72
##     S31  1  1   72
##     S32  1  1   72
##     S33  1  1   72
##     S34  1  1   72
##     S35  1  1   72
##     S36  1  1   72
##     S37  1  1   72
##     S38  1  1   72
##     S39  1  1   72
##     S4   1  1   72
##     S5   1  1   72
##     S6   1  1   72
##     S7   1  1   72
##     S8   1  1   72
##     S9   1  1   72

Add conditions

If we have a look at our data frame using the head function, there is a variable called “type” in which we have the conditions specified: filler_gram, filler_ungram, wh_long, wh_short, local_long, local_short.

head(dat)
##       subj_ID                              MD5            controller
## 25 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 28 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 31 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 34 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 37 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 40 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
##    item_abs element          type item sentence rating age    sex spain
## 25      117       0   filler_gram NULL     NULL      7  27 female   yes
## 28      136       0 filler_ungram NULL     NULL      3  27 female   yes
## 31       82       0      wh_short    4     NULL      5  27 female   yes
## 34      105       0   filler_gram NULL     NULL      7  27 female   yes
## 37      164       0 filler_ungram NULL     NULL      4  27 female   yes
## 40      156       0 filler_ungram NULL     NULL      3  27 female   yes
##    spanish consent subject
## 25     yes     yes      S1
## 28     yes     yes      S1
## 31     yes     yes      S1
## 34     yes     yes      S1
## 37     yes     yes      S1
## 40     yes     yes      S1

We need to code this information into three new columns. First, we are going to add a column specifying whether a particular item is a filler or an experimental sentence. This way, we can subset our data frame and analyze either filler or experimental items.

dat$exp <- as.factor(ifelse(dat$type=="filler_gram", "filler",
                     ifelse(dat$type=="filler_ungram", "filler",
                            "exp")))

Then, we are going to add two columns to code our two factors: “dependency” (wh vs. local) and “length” (short vs. long):

dat$dependency <- as.factor(ifelse(dat$type=="wh_short", "wh",
                            ifelse(dat$type=="wh_long", "wh",
                            ifelse(dat$type=="local_short","local",
                            ifelse(dat$type=="local_long","local",
                            NA)))))

dat$length <- as.factor(ifelse(dat$type=="wh_short", "short",
                            ifelse(dat$type=="wh_long", "long",
                            ifelse(dat$type=="local_short","short",
                            ifelse(dat$type=="local_long","long",
                            NA)))))

We can also add a new column with trial number information:

dat$trial <- as.factor(seq(1,resp))

Calculate z-scores

There are different approaches to analyze Likert-scale data. An approach that’s been widely used in experimental syntax is fitting a linear mixed model. Our dependent variable is ordinal because we have 7 ordered categories (from 1 to 7) and the distance between categories is unknown. The problem is that we cannot fit a linear mixed model with an ordinal dependent variable. Thus, we need to transform our data into z-scores.

The population z-score of a measurement x is: \[z=\frac{x-\mu}{\sigma}\] where \(\mu\) is the population mean and \(\sigma\) is the population standard deviation.

The sample z-score of a measurement x is:

\[z=\frac{x-\bar{x}}{s}\] where \(\bar{x}\) is the sample mean and \(s\) is the sample standard deviation.

In short, a z-score tells us how many standard deviations a data point deviates from the sample mean. For example, a z-score of 1.6 means that particular data point is 1.6 standard deviations above the mean, whereas a z-score of -0.5 means that data point is 0.5 standard deviations below the mean.

When we transform our data into z-scores, our dependent variable is now continuous and centered around 0. In our case, we are going to calculate z-scores by subject. That is, we are going to calculate the mean and standard deviation of each subject, and use these values to calculate z-scores for each subject individually. This approach eliminates individual scale biases and makes data more comparable between subjects. Participants behave ideosyncratically: some participants only use extreme values when rating sentences (1 and 7), others never use extreme values and stick to intermediate ratings (3, 4, 5), and others use the whole scale. By computing z-scores taking into account each participant’s mean and standard deviation, individual ratings can now be compared in the same scale.

Even though this seems complicated, we can easily compute z-scores by subject and store them in a new column called “z.score.rating” with the following code:

dat$z.score.rating <- ave(as.numeric(dat$rating), dat$subject, FUN=scale)

Organizing the data frame

Now we have all the information we need in our data frame, but it looks kind of messy:

head(dat)
##       subj_ID                              MD5            controller
## 25 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 28 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 31 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 34 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 37 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
## 40 1550858942 845661b29bd302686e29263fa1cf016c AcceptabilityJudgment
##    item_abs element          type item sentence rating age    sex spain
## 25      117       0   filler_gram NULL     NULL      7  27 female   yes
## 28      136       0 filler_ungram NULL     NULL      3  27 female   yes
## 31       82       0      wh_short    4     NULL      5  27 female   yes
## 34      105       0   filler_gram NULL     NULL      7  27 female   yes
## 37      164       0 filler_ungram NULL     NULL      4  27 female   yes
## 40      156       0 filler_ungram NULL     NULL      3  27 female   yes
##    spanish consent subject    exp dependency length trial z.score.rating
## 25     yes     yes      S1 filler       <NA>   <NA>     1     1.52216717
## 28     yes     yes      S1 filler       <NA>   <NA>     2    -0.43272311
## 31     yes     yes      S1    exp         wh  short     3     0.54472203
## 34     yes     yes      S1 filler       <NA>   <NA>     4     1.52216717
## 37     yes     yes      S1 filler       <NA>   <NA>     5     0.05599946
## 40     yes     yes      S1 filler       <NA>   <NA>     6    -0.43272311

First, we our going to transform all the NULLcells into NAs:

dat[dat == "NULL"] = NA

Then, we are going to get rid of the columns we don’t need anymore:

dat <- droplevels(subset(dat, select=-c(subj_ID, MD5, item_abs, 
                                         element, controller, sentence)))

We are going to change the order of columns to make it clearer:

colnames(dat)
##  [1] "type"           "item"           "rating"         "age"           
##  [5] "sex"            "spain"          "spanish"        "consent"       
##  [9] "subject"        "exp"            "dependency"     "length"        
## [13] "trial"          "z.score.rating"
dat <- dat[, c(9, 4, 5, 6, 7, 8, 1, 13, 10, 2, 11, 12, 3, 14)]
colnames(dat)
##  [1] "subject"        "age"            "sex"            "spain"         
##  [5] "spanish"        "consent"        "type"           "trial"         
##  [9] "exp"            "item"           "dependency"     "length"        
## [13] "rating"         "z.score.rating"

And that’s it! Our data frame is ready to be analyzed. We can have a look at the first rows to see what it looks like:

head(dat)
##    subject age    sex spain spanish consent          type trial    exp
## 25      S1  27 female   yes     yes     yes   filler_gram     1 filler
## 28      S1  27 female   yes     yes     yes filler_ungram     2 filler
## 31      S1  27 female   yes     yes     yes      wh_short     3    exp
## 34      S1  27 female   yes     yes     yes   filler_gram     4 filler
## 37      S1  27 female   yes     yes     yes filler_ungram     5 filler
## 40      S1  27 female   yes     yes     yes filler_ungram     6 filler
##    item dependency length rating z.score.rating
## 25 <NA>       <NA>   <NA>      7     1.52216717
## 28 <NA>       <NA>   <NA>      3    -0.43272311
## 31    4         wh  short      5     0.54472203
## 34 <NA>       <NA>   <NA>      7     1.52216717
## 37 <NA>       <NA>   <NA>      4     0.05599946
## 40 <NA>       <NA>   <NA>      3    -0.43272311

We can also save it as .csv file in case we want to share it with other researchers or upload it to an online repository:

write.csv(dat,"data/results_clean.csv", row.names=FALSE)

Participants’ information

Now that we have our data frame, we can have a look at the participants’ information. We can calculate the mean age, standard deviation and age range:

round(mean(dat$age),2)
## [1] 32.64
round(sd(dat$age),2)
## [1] 9.92
range(dat$age)
## [1] 19 60

We can also classify participants depending on gender:

gender<-subset(dat, dat$item=="1")
xtabs(~ sex, gender)
## sex
## female   male 
##     27     12

We can now check whether they are native speakers. First, we subset those who meet the two criteria we set in our HTML form:

datsub <- droplevels(subset(dat, spanish=="yes" & spain=="yes"))

This is the total number of participants:

nlevels(dat$subject)
## [1] 39

And the total number of participants who are native speakers accoridng to our criteria:

nlevels(datsub$subject)
## [1] 36

This is the difference, i.e. participants that are not native speakers and will therefore be removed from the analysis:

nlevels(dat$subject)-nlevels(datsub$subject)
## [1] 3