First, as always, let’s load in the packages and dataset we’ll be working with for these examples. To start, let’s use the jury dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
jury <- read.csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/Jury/jury_data_basic.csv")

& and |

Something important we haven’t covered yet is the AND and OR operators. The AND operator (typed as &, called “ampersand”) takes two logical values and returns TRUE only if both values are TRUE themselves. For instance, suppose we have a variable x that is equal to 12, and another variable y equal to 17. We can then check if each of these variables is greater than 5 and less than 15:

x <- 12
x > 5 & x < 15
## [1] TRUE
y <- 17
y > 5 & y < 15
## [1] FALSE

The OR operator (typed as |, called “pike”) takes two logical values and returns TRUE if either one (or both) values are TRUE themselves. If we replace the ANDs in the previous example with ORs, here’s what we get:

x <- 12
x > 5 | x < 15
## [1] TRUE
y <- 17
y > 5 | y < 15
## [1] TRUE

Since y equals 17, and since 17 is greater than 5, the answer is TRUE to the question “Is y greater than 5 or less than 15?”.

These operators are very useful when you’re filtering/subsetting your data, too. You can combine subset conditions using AND or OR. For instance, here we can subset out a dataframe of just the observations where the participants rated themselves under 50 on all three of the familiarity scales – that is, they rated themselves under 50 for SSBE familiarity AND under 50 for Middlesbrough familiarity AND under 50 for Newcastle familiarity.

lowfamiliarity <- subset(jury, SSBEFamiliarity < 50 & MiddlesbroughFamiliarity < 50 & NewcastleFamiliarity < 50)

…or maybe we want to make a new column that tells us whether the first sound clip in each pair is from a northern, rather than southern, voice (that is, either Middlesbrough or Newcastle is in the Audio1Accent column).

jury$northern <- jury$Audio1Accent == "Newcastle" | jury$Audio1Accent == "Middlesbrough"

Cross tabulation

For character vectors, we can get the overall number of cells per category using the xtabs() function. xtabs() tabulates, or counts up, the number of cells with each unique label. Here we can look at the number of rows by Level – that is, how many pairs of voices were compared in each level of this experiment. Note the use of the tilde (~) in this formulation. The use of the tilde is a bit idiosyncratic in R, and frequently means “as a function of”, but just go with it for now. Like natural languages, computer languages sometimes also have idiosyncratic features.

xtabs(~Level, jury)
## Level
##           1           2 3: evidence   3: expert 
##       12040       12040        2384        9656

We can get further breakdowns of the counts using the plus sign. The following example breaks down the number of pairs in each level that were different-speaker pairs vs. same-speaker pairs.

xtabs(~Level + CorrectAnswer, jury)
##              CorrectAnswer
## Level         different same
##   1                7525 4515
##   2                7525 4515
##   3: evidence      1490  894
##   3: expert        6035 3621
# or, if we wanted our table flipped around the other way:

xtabs(~CorrectAnswer + Level, jury)
##              Level
## CorrectAnswer    1    2 3: evidence 3: expert
##     different 7525 7525        1490      6035
##     same      4515 4515         894      3621

You can also obtain counts and proportions of vectors with categorical variables with tidyverse code.

counts <- jury %>% 
  group_by(Level) %>% 
  summarize(count = n())

# This should give you an ERROR because length() requires you to type out an argument, unlike n()
counts <- jury %>% 
  group_by(Level) %>% 
  summarize(count = length())

# This code should work. Note you can replace the argument inside length() with *any* of the column names (Level, Vowel, SSBEFamiliarity, Participant, etc.) and it will generate the same by-Level row count.
counts <- jury %>% 
  group_by(Level) %>% 
  summarize(count = length(Level))

# If you want to group by two variables, you just put them both in group_by(), separated by a comma:
counts <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarize(count = n())

Selecting columns and getting unique values

If we run the following code, what we see is the number of ratings completed overall for each level.

jury %>% 
  group_by(Level) %>% 
  summarize(count = n())
## # A tibble: 4 × 2
##   Level       count
##   <chr>       <int>
## 1 1           12040
## 2 2           12040
## 3 3: evidence  2384
## 4 3: expert    9656

Suppose we wanted to find out how many participants completed each level instead of how many observations we have per level (in this case, each participant rated eight pairs of voices per level). If we add “Participant” to the group_by() argument, that doesn’t give us what we want – it’ll just tell us how many ratings were done by each participant for each level:

jury %>% 
  group_by(Level, Participant) %>% 
  summarize(count = n())
## `summarise()` has grouped output by 'Level'. You can override using the
## `.groups` argument.
## # A tibble: 4,515 × 3
## # Groups:   Level [4]
##    Level Participant count
##    <chr> <chr>       <int>
##  1 1     5422ae1afd      8
##  2 1     5460ef52fd      8
##  3 1     54ad30a4fd      8
##  4 1     55101de0fd      8
##  5 1     55146e64fd      8
##  6 1     5516e759fd      8
##  7 1     556dcc18fd      8
##  8 1     557d8066fd      8
##  9 1     55a29d4dfd      8
## 10 1     55b0d269fd      8
## # ℹ 4,505 more rows

One solution is to instead first collapse the “Participant” column such that we only have one row per participant per level, which we can then sum up. In this case, it’s useful to select just the columns we’re interested in: we can ignore all the other columns for now as we add up how many participants did each level. If we wanted to make a new dataframe with just these two columns, we can use the select() function to do so:

participant_level <- jury %>% 
  select(Participant, Level)

If you view this resulting dataframe, you can see that we now have lots of duplicate identical rows. We don’t want to stop there! We can then use the unique() function to delete all the duplicate rows, keeping only one version of each unique row:

participant_level <- jury %>% 
  select(Participant, Level) %>%
  unique()

We can now add our group_by() and summarize() functions to group by Level and then get the count for how many individual participants did each one:

jury %>% 
  select(Participant, Level) %>%
  unique() %>%
  group_by(Level) %>%
  summarize(count = n())
## # A tibble: 4 × 2
##   Level       count
##   <chr>       <int>
## 1 1            1505
## 2 2            1505
## 3 3: evidence   298
## 4 3: expert    1207
# or, if we want to create a dataframe with this table rather than just spitting out the answer in the console...

participant_level <- jury %>% 
  select(Participant, Level) %>%
  unique() %>%
  group_by(Level) %>%
  summarize(count = n())

Here we see that all 1,505 participants did levels 1 and 2, but some did a version of level 3 involving expert witness testimony, and some did a version of level 3 involving the introduction of a new piece of evidence.

NA values

For this example, let’s load in another dataset. Download the mandarin_f0.csv file from eClass and load it in to your RStudio environment.

mandarin <- read.csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/From UoY QM course/datasets/mandarin_f0.csv")

This is a dataset containing f0 (pitch) measurements. Each sound file (named in the “file” column) contained several words; each vowel was located (named in the “vowel” column, along with the preceding “prec” and following “foll” sounds) and its start and end points were marked; from these start and end points the duration could be calculated (“dur”), and f0 could be measured at 10 time points through the vowel (so “f0_1” is the f0 at the beginning of the vowel, “f0_2” is the f0 at the next time point, all the way to “f0_10” at the end).

Suppose we want to find the mean of all the f0 measurements at timepoint 5 (that is, f0_5), grouped by file, so that we get a sense of the average f0 within each file at about the midpoint of each vowel.

mandarin %>% 
  group_by(file) %>% 
  summarise(mean_f0 = mean(f0_5))
## # A tibble: 63 × 2
##    file                      mean_f0
##    <chr>                       <dbl>
##  1 ALL_005_M_CMN_CMN_DHR.wav      NA
##  2 ALL_005_M_CMN_CMN_HT1.wav      NA
##  3 ALL_005_M_CMN_CMN_HT2.wav      NA
##  4 ALL_005_M_CMN_CMN_LPP.wav      NA
##  5 ALL_005_M_CMN_CMN_NWS.wav      NA
##  6 ALL_012_M_CMN_CMN_DHR.wav      NA
##  7 ALL_012_M_CMN_CMN_HT1.wav      NA
##  8 ALL_012_M_CMN_CMN_HT2.wav      NA
##  9 ALL_012_M_CMN_CMN_LPP.wav      NA
## 10 ALL_012_M_CMN_CMN_NWS.wav      NA
## # ℹ 53 more rows

If you look at the resulting table, everything is NA! That’s not good. The reason is that if you have “NA” values in your vector (column), R will be unable to take the mean or perform other standard mathematical functions on it. You’ll know if this happens to you because the returned value (e.g., for the mean) will be “NA”. One way to omit NA values is to wrap the object within the function na.omit():

mandarin %>% 
  group_by(file) %>% 
  summarise(mean_f0 = mean(na.omit(f0_5)))
## # A tibble: 63 × 2
##    file                      mean_f0
##    <chr>                       <dbl>
##  1 ALL_005_M_CMN_CMN_DHR.wav    152.
##  2 ALL_005_M_CMN_CMN_HT1.wav    164.
##  3 ALL_005_M_CMN_CMN_HT2.wav    164.
##  4 ALL_005_M_CMN_CMN_LPP.wav    152.
##  5 ALL_005_M_CMN_CMN_NWS.wav    139.
##  6 ALL_012_M_CMN_CMN_DHR.wav    117.
##  7 ALL_012_M_CMN_CMN_HT1.wav    120.
##  8 ALL_012_M_CMN_CMN_HT2.wav    126.
##  9 ALL_012_M_CMN_CMN_LPP.wav    120.
## 10 ALL_012_M_CMN_CMN_NWS.wav    112.
## # ℹ 53 more rows

Another strategy that does the same thing is using the optional argument “na.rm = TRUE”, where na.rm means “remove the NA values”.

mandarin %>% 
  group_by(file) %>% 
  summarise(mean_f0 = mean(f0_5, na.rm = TRUE))
## # A tibble: 63 × 2
##    file                      mean_f0
##    <chr>                       <dbl>
##  1 ALL_005_M_CMN_CMN_DHR.wav    152.
##  2 ALL_005_M_CMN_CMN_HT1.wav    164.
##  3 ALL_005_M_CMN_CMN_HT2.wav    164.
##  4 ALL_005_M_CMN_CMN_LPP.wav    152.
##  5 ALL_005_M_CMN_CMN_NWS.wav    139.
##  6 ALL_012_M_CMN_CMN_DHR.wav    117.
##  7 ALL_012_M_CMN_CMN_HT1.wav    120.
##  8 ALL_012_M_CMN_CMN_HT2.wav    126.
##  9 ALL_012_M_CMN_CMN_LPP.wav    120.
## 10 ALL_012_M_CMN_CMN_NWS.wav    112.
## # ℹ 53 more rows

If we don’t want to type na.omit() or na.rm = TRUE over and over again, we can potentially use the drop_na() function. What this does is it removes all rows that contain any NA values – not just rows where there are NA values for f0_5, but for any variable. If our data contained rows that contained a number in the f0_5 column but an NA in the f0_4 column, we probably wouldn’t want to use this drop_na() function because it would remove too many rows.

na_dropped <- mandarin %>%
  drop_na()

nrow(mandarin)
## [1] 27352
nrow(na_dropped)
## [1] 11463

Here you can see that we’ve eliminated about half the rows, because so many of them have one or more NAs in them! Supposing that was something we wanted to do, we could have drop_na() as our first step, after which we use the group and summarize functions.

mandarin %>% 
  drop_na() %>%
  group_by(file) %>% 
  summarise(mean_f0 = mean(f0_5))
## # A tibble: 63 × 2
##    file                      mean_f0
##    <chr>                       <dbl>
##  1 ALL_005_M_CMN_CMN_DHR.wav    160.
##  2 ALL_005_M_CMN_CMN_HT1.wav    164.
##  3 ALL_005_M_CMN_CMN_HT2.wav    169.
##  4 ALL_005_M_CMN_CMN_LPP.wav    159.
##  5 ALL_005_M_CMN_CMN_NWS.wav    147.
##  6 ALL_012_M_CMN_CMN_DHR.wav    119.
##  7 ALL_012_M_CMN_CMN_HT1.wav    119.
##  8 ALL_012_M_CMN_CMN_HT2.wav    126.
##  9 ALL_012_M_CMN_CMN_LPP.wav    122.
## 10 ALL_012_M_CMN_CMN_NWS.wav    115.
## # ℹ 53 more rows

Calculating Standard Error

We’ve seen before in Tutorial 2 how to get a dataframe with means, standard deviations, and counts of observations:

same_means <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarize(mean_sameness = mean(SameAnswer), 
            sd_sameness = sd(SameAnswer),
            count = length(SameAnswer))
## `summarise()` has grouped output by 'Level'. You can override using the
## `.groups` argument.
f0_means <- mandarin %>% 
  group_by(file) %>% 
  summarize(mean_f0 = mean(na.omit(f0_5)), 
            sd_f0 = sd(na.omit(f0_5)),
            count = length(na.omit(f0_5)))

We’ve also seen in our class lecture that the standard error is calculated by taking the standard deviation and dividing by the square root of the number of observations. That means that once we have a dataframe saved with the means, standard deviations, and counts, all we need to do is add a column that does the standard error calculation:

same_means$se <- same_means$sd_sameness / sqrt(same_means$count)

f0_means$se <- f0_means$sd_f0 / sqrt(f0_means$count)

(Let’s put aside for the moment that standard error should be calculated based on a set of means of independent sampling distributions, such as within-speaker means, not raw observations. In the practice below, you’ll be able to do it the proper way and first calculate sample means, from which you then get the means-of-means and sd-of-means and count-of-means, and then can calculate the standard error.)

Saving a table

You might want to save a table of values as a .csv or text file for ease of then copying it into a report/paper. You can use the write.table() function to create a space-delimited .txt file, or write.csv() to create a comma-delimited .csv file. Within each of these functions, you first tell it the name of the dataframe you want to save as a table, then within “” quotes you put the path name to where you want to save it, ending in the name you want to give the new file and the .csv or .txt extension it should have.

When using write.table(), it’ll default to writing as a space-delimited .text file – that is, there will be a space between each of the columns. You can also make it a tab-delimited file by adding the argument sep = “\\t” which tells R to separate the columns with a tab.

You can also set row.names to FALSE; this is recommended to prevent R from saving the row numbers as an extra column in your file, which can cause issues if you read the data back into R later. You can see in the following two examples that when row.names = FALSE is not included, you end up with an initial column with row numbers that is in most cases extraneous/unnecessary.

write.table(same_means, "/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/sameness_means.txt", sep = "\\t", row.names = FALSE)

write.csv(f0_means, "/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/f0_means.csv")

Practice with datasets

  1. Load in ‘L2_English_Lexical_Decision_Data.csv’ and call it ‘lex’. Not again here that participant is in the workerID column.
  2. Create another column in lex called ‘gender’. If the sex of the participant is equal to 1, then the new value in the ‘gender’ column should be “m” for male, otherwise the new value should be “f”.
  3. Get a count of how many observations we have from male vs. female participants.
  4. Get a count of how many male vs. female participants are in the sample.
  5. Create a new column called ‘Germanic’. If the domLang is English, German, Norwegian, Danish, or Afrikaans, the new value in the ‘Germanic’ column should be “Germanic”, otherwise it should be “Other”.
  6. Get a count of how many male vs. female participants there are whose dominant language is Germanic vs. other.
  7. Create a dataframe that summarizes the within-participant reaction time (RT) means, medians, standard deviations, and counts of observations. (Hint: in preparation for the next step, it would be wise to include not only workerID but also the Germanic and gender columns in your group_by() function.)
  8. Based on the dataframe created in step 7, create another dataframe that summarizes the mean of means, standard deviations, and standard errors of reaction times for the data, grouped by gender and Germanic vs. other dominant language. (Hint: you will probably need to first create a dataframe with all the measurements except for standard error, and then calculate standard error in a subsequent line of code.)
  9. Save the dataframe created in step 8 to a tab-delimited .txt table.

Disclaimer: Some of these original materials were put together by Eleanor Chodroff and Elisa Passoni for the University of York. Thomas Kettig then inherited it and modified as needed, particularly based on notes by Nathan Sanders from the University of Toronto. The R software and the packages are distributed under the terms of the GNU General Public License, either Version 2, June 1991 or Version 3, June 2007 (run the command licence () for more information)