First, as always, let’s load in the packages and dataset we’ll be working with for these examples. To start, let’s use the jury dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
jury <- read.csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/Jury/jury_data_basic.csv")
Something important we haven’t covered yet is the AND and OR operators. The AND operator (typed as &, called “ampersand”) takes two logical values and returns TRUE only if both values are TRUE themselves. For instance, suppose we have a variable x that is equal to 12, and another variable y equal to 17. We can then check if each of these variables is greater than 5 and less than 15:
x <- 12
x > 5 & x < 15
## [1] TRUE
y <- 17
y > 5 & y < 15
## [1] FALSE
The OR operator (typed as |, called “pike”) takes two logical values and returns TRUE if either one (or both) values are TRUE themselves. If we replace the ANDs in the previous example with ORs, here’s what we get:
x <- 12
x > 5 | x < 15
## [1] TRUE
y <- 17
y > 5 | y < 15
## [1] TRUE
Since y equals 17, and since 17 is greater than 5, the answer is TRUE to the question “Is y greater than 5 or less than 15?”.
These operators are very useful when you’re filtering/subsetting your data, too. You can combine subset conditions using AND or OR. For instance, here we can subset out a dataframe of just the observations where the participants rated themselves under 50 on all three of the familiarity scales – that is, they rated themselves under 50 for SSBE familiarity AND under 50 for Middlesbrough familiarity AND under 50 for Newcastle familiarity.
lowfamiliarity <- subset(jury, SSBEFamiliarity < 50 & MiddlesbroughFamiliarity < 50 & NewcastleFamiliarity < 50)
…or maybe we want to make a new column that tells us whether the first sound clip in each pair is from a northern, rather than southern, voice (that is, either Middlesbrough or Newcastle is in the Audio1Accent column).
jury$northern <- jury$Audio1Accent == "Newcastle" | jury$Audio1Accent == "Middlesbrough"
For character vectors, we can get the overall number of cells per category using the xtabs() function. xtabs() tabulates, or counts up, the number of cells with each unique label. Here we can look at the number of rows by Level – that is, how many pairs of voices were compared in each level of this experiment. Note the use of the tilde (~) in this formulation. The use of the tilde is a bit idiosyncratic in R, and frequently means “as a function of”, but just go with it for now. Like natural languages, computer languages sometimes also have idiosyncratic features.
xtabs(~Level, jury)
## Level
## 1 2 3: evidence 3: expert
## 12040 12040 2384 9656
We can get further breakdowns of the counts using the plus sign. The following example breaks down the number of pairs in each level that were different-speaker pairs vs. same-speaker pairs.
xtabs(~Level + CorrectAnswer, jury)
## CorrectAnswer
## Level different same
## 1 7525 4515
## 2 7525 4515
## 3: evidence 1490 894
## 3: expert 6035 3621
# or, if we wanted our table flipped around the other way:
xtabs(~CorrectAnswer + Level, jury)
## Level
## CorrectAnswer 1 2 3: evidence 3: expert
## different 7525 7525 1490 6035
## same 4515 4515 894 3621
You can also obtain counts and proportions of vectors with categorical variables with tidyverse code.
counts <- jury %>%
group_by(Level) %>%
summarize(count = n())
# This should give you an ERROR because length() requires you to type out an argument, unlike n()
counts <- jury %>%
group_by(Level) %>%
summarize(count = length())
# This code should work. Note you can replace the argument inside length() with *any* of the column names (Level, Vowel, SSBEFamiliarity, Participant, etc.) and it will generate the same by-Level row count.
counts <- jury %>%
group_by(Level) %>%
summarize(count = length(Level))
# If you want to group by two variables, you just put them both in group_by(), separated by a comma:
counts <- jury %>%
group_by(Level, CorrectAnswer) %>%
summarize(count = n())
If we run the following code, what we see is the number of ratings completed overall for each level.
jury %>%
group_by(Level) %>%
summarize(count = n())
## # A tibble: 4 × 2
## Level count
## <chr> <int>
## 1 1 12040
## 2 2 12040
## 3 3: evidence 2384
## 4 3: expert 9656
Suppose we wanted to find out how many participants completed each level instead of how many observations we have per level (in this case, each participant rated eight pairs of voices per level). If we add “Participant” to the group_by() argument, that doesn’t give us what we want – it’ll just tell us how many ratings were done by each participant for each level:
jury %>%
group_by(Level, Participant) %>%
summarize(count = n())
## `summarise()` has grouped output by 'Level'. You can override using the
## `.groups` argument.
## # A tibble: 4,515 × 3
## # Groups: Level [4]
## Level Participant count
## <chr> <chr> <int>
## 1 1 5422ae1afd 8
## 2 1 5460ef52fd 8
## 3 1 54ad30a4fd 8
## 4 1 55101de0fd 8
## 5 1 55146e64fd 8
## 6 1 5516e759fd 8
## 7 1 556dcc18fd 8
## 8 1 557d8066fd 8
## 9 1 55a29d4dfd 8
## 10 1 55b0d269fd 8
## # ℹ 4,505 more rows
One solution is to instead first collapse the “Participant” column such that we only have one row per participant per level, which we can then sum up. In this case, it’s useful to select just the columns we’re interested in: we can ignore all the other columns for now as we add up how many participants did each level. If we wanted to make a new dataframe with just these two columns, we can use the select() function to do so:
participant_level <- jury %>%
select(Participant, Level)
If you view this resulting dataframe, you can see that we now have lots of duplicate identical rows. We don’t want to stop there! We can then use the unique() function to delete all the duplicate rows, keeping only one version of each unique row:
participant_level <- jury %>%
select(Participant, Level) %>%
unique()
We can now add our group_by() and summarize() functions to group by Level and then get the count for how many individual participants did each one:
jury %>%
select(Participant, Level) %>%
unique() %>%
group_by(Level) %>%
summarize(count = n())
## # A tibble: 4 × 2
## Level count
## <chr> <int>
## 1 1 1505
## 2 2 1505
## 3 3: evidence 298
## 4 3: expert 1207
# or, if we want to create a dataframe with this table rather than just spitting out the answer in the console...
participant_level <- jury %>%
select(Participant, Level) %>%
unique() %>%
group_by(Level) %>%
summarize(count = n())
Here we see that all 1,505 participants did levels 1 and 2, but some did a version of level 3 involving expert witness testimony, and some did a version of level 3 involving the introduction of a new piece of evidence.
For this example, let’s load in another dataset. Download the mandarin_f0.csv file from eClass and load it in to your RStudio environment.
mandarin <- read.csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/From UoY QM course/datasets/mandarin_f0.csv")
This is a dataset containing f0 (pitch) measurements. Each sound file (named in the “file” column) contained several words; each vowel was located (named in the “vowel” column, along with the preceding “prec” and following “foll” sounds) and its start and end points were marked; from these start and end points the duration could be calculated (“dur”), and f0 could be measured at 10 time points through the vowel (so “f0_1” is the f0 at the beginning of the vowel, “f0_2” is the f0 at the next time point, all the way to “f0_10” at the end).
Suppose we want to find the mean of all the f0 measurements at timepoint 5 (that is, f0_5), grouped by file, so that we get a sense of the average f0 within each file at about the midpoint of each vowel.
mandarin %>%
group_by(file) %>%
summarise(mean_f0 = mean(f0_5))
## # A tibble: 63 × 2
## file mean_f0
## <chr> <dbl>
## 1 ALL_005_M_CMN_CMN_DHR.wav NA
## 2 ALL_005_M_CMN_CMN_HT1.wav NA
## 3 ALL_005_M_CMN_CMN_HT2.wav NA
## 4 ALL_005_M_CMN_CMN_LPP.wav NA
## 5 ALL_005_M_CMN_CMN_NWS.wav NA
## 6 ALL_012_M_CMN_CMN_DHR.wav NA
## 7 ALL_012_M_CMN_CMN_HT1.wav NA
## 8 ALL_012_M_CMN_CMN_HT2.wav NA
## 9 ALL_012_M_CMN_CMN_LPP.wav NA
## 10 ALL_012_M_CMN_CMN_NWS.wav NA
## # ℹ 53 more rows
If you look at the resulting table, everything is NA! That’s not good. The reason is that if you have “NA” values in your vector (column), R will be unable to take the mean or perform other standard mathematical functions on it. You’ll know if this happens to you because the returned value (e.g., for the mean) will be “NA”. One way to omit NA values is to wrap the object within the function na.omit():
mandarin %>%
group_by(file) %>%
summarise(mean_f0 = mean(na.omit(f0_5)))
## # A tibble: 63 × 2
## file mean_f0
## <chr> <dbl>
## 1 ALL_005_M_CMN_CMN_DHR.wav 152.
## 2 ALL_005_M_CMN_CMN_HT1.wav 164.
## 3 ALL_005_M_CMN_CMN_HT2.wav 164.
## 4 ALL_005_M_CMN_CMN_LPP.wav 152.
## 5 ALL_005_M_CMN_CMN_NWS.wav 139.
## 6 ALL_012_M_CMN_CMN_DHR.wav 117.
## 7 ALL_012_M_CMN_CMN_HT1.wav 120.
## 8 ALL_012_M_CMN_CMN_HT2.wav 126.
## 9 ALL_012_M_CMN_CMN_LPP.wav 120.
## 10 ALL_012_M_CMN_CMN_NWS.wav 112.
## # ℹ 53 more rows
Another strategy that does the same thing is using the optional argument “na.rm = TRUE”, where na.rm means “remove the NA values”.
mandarin %>%
group_by(file) %>%
summarise(mean_f0 = mean(f0_5, na.rm = TRUE))
## # A tibble: 63 × 2
## file mean_f0
## <chr> <dbl>
## 1 ALL_005_M_CMN_CMN_DHR.wav 152.
## 2 ALL_005_M_CMN_CMN_HT1.wav 164.
## 3 ALL_005_M_CMN_CMN_HT2.wav 164.
## 4 ALL_005_M_CMN_CMN_LPP.wav 152.
## 5 ALL_005_M_CMN_CMN_NWS.wav 139.
## 6 ALL_012_M_CMN_CMN_DHR.wav 117.
## 7 ALL_012_M_CMN_CMN_HT1.wav 120.
## 8 ALL_012_M_CMN_CMN_HT2.wav 126.
## 9 ALL_012_M_CMN_CMN_LPP.wav 120.
## 10 ALL_012_M_CMN_CMN_NWS.wav 112.
## # ℹ 53 more rows
If we don’t want to type na.omit() or na.rm = TRUE over and over again, we can potentially use the drop_na() function. What this does is it removes all rows that contain any NA values – not just rows where there are NA values for f0_5, but for any variable. If our data contained rows that contained a number in the f0_5 column but an NA in the f0_4 column, we probably wouldn’t want to use this drop_na() function because it would remove too many rows.
na_dropped <- mandarin %>%
drop_na()
nrow(mandarin)
## [1] 27352
nrow(na_dropped)
## [1] 11463
Here you can see that we’ve eliminated about half the rows, because so many of them have one or more NAs in them! Supposing that was something we wanted to do, we could have drop_na() as our first step, after which we use the group and summarize functions.
mandarin %>%
drop_na() %>%
group_by(file) %>%
summarise(mean_f0 = mean(f0_5))
## # A tibble: 63 × 2
## file mean_f0
## <chr> <dbl>
## 1 ALL_005_M_CMN_CMN_DHR.wav 160.
## 2 ALL_005_M_CMN_CMN_HT1.wav 164.
## 3 ALL_005_M_CMN_CMN_HT2.wav 169.
## 4 ALL_005_M_CMN_CMN_LPP.wav 159.
## 5 ALL_005_M_CMN_CMN_NWS.wav 147.
## 6 ALL_012_M_CMN_CMN_DHR.wav 119.
## 7 ALL_012_M_CMN_CMN_HT1.wav 119.
## 8 ALL_012_M_CMN_CMN_HT2.wav 126.
## 9 ALL_012_M_CMN_CMN_LPP.wav 122.
## 10 ALL_012_M_CMN_CMN_NWS.wav 115.
## # ℹ 53 more rows
We’ve seen before in Tutorial 2 how to get a dataframe with means, standard deviations, and counts of observations:
same_means <- jury %>%
group_by(Level, CorrectAnswer) %>%
summarize(mean_sameness = mean(SameAnswer),
sd_sameness = sd(SameAnswer),
count = length(SameAnswer))
## `summarise()` has grouped output by 'Level'. You can override using the
## `.groups` argument.
f0_means <- mandarin %>%
group_by(file) %>%
summarize(mean_f0 = mean(na.omit(f0_5)),
sd_f0 = sd(na.omit(f0_5)),
count = length(na.omit(f0_5)))
We’ve also seen in our class lecture that the standard error is calculated by taking the standard deviation and dividing by the square root of the number of observations. That means that once we have a dataframe saved with the means, standard deviations, and counts, all we need to do is add a column that does the standard error calculation:
same_means$se <- same_means$sd_sameness / sqrt(same_means$count)
f0_means$se <- f0_means$sd_f0 / sqrt(f0_means$count)
(Let’s put aside for the moment that standard error should be calculated based on a set of means of independent sampling distributions, such as within-speaker means, not raw observations. In the practice below, you’ll be able to do it the proper way and first calculate sample means, from which you then get the means-of-means and sd-of-means and count-of-means, and then can calculate the standard error.)
You might want to save a table of values as a .csv or text file for ease of then copying it into a report/paper. You can use the write.table() function to create a space-delimited .txt file, or write.csv() to create a comma-delimited .csv file. Within each of these functions, you first tell it the name of the dataframe you want to save as a table, then within “” quotes you put the path name to where you want to save it, ending in the name you want to give the new file and the .csv or .txt extension it should have.
When using write.table(), it’ll default to writing as a space-delimited .text file – that is, there will be a space between each of the columns. You can also make it a tab-delimited file by adding the argument sep = “\\t” which tells R to separate the columns with a tab.
You can also set row.names to FALSE; this is recommended to prevent R from saving the row numbers as an extra column in your file, which can cause issues if you read the data back into R later. You can see in the following two examples that when row.names = FALSE is not included, you end up with an initial column with row numbers that is in most cases extraneous/unnecessary.
write.table(same_means, "/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/sameness_means.txt", sep = "\\t", row.names = FALSE)
write.csv(f0_means, "/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/f0_means.csv")
Disclaimer: Some of these original materials were put together by Eleanor Chodroff and Elisa Passoni for the University of York. Thomas Kettig then inherited it and modified as needed, particularly based on notes by Nathan Sanders from the University of Toronto. The R software and the packages are distributed under the terms of the GNU General Public License, either Version 2, June 1991 or Version 3, June 2007 (run the command licence () for more information)