LING 3300 - Tutorial 2

Load in the dataset we’ll be working with for these examples

jury <- read.csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/Jury/jury_data_basic.csv", header = TRUE)

Remember - here you should change the path that’s between the ” ” so that it points to where on YOUR computer the .csv is. If you don’t know how to do this easily, Google “how to find the path name of a file” + your operating system (Mac, Windows 10, etc.). This will be a very valuable skill going forward, so master it now!

Also remember that the reason we put header = TRUE is so that R knows that the first line of our csv is the names of our columns, not data itself.

Installing packages

In general, it’s useful to know how to install and update packages. Packages are bundles of code that you can import into R. They’re sort of like software applications that you install and then open up when needed. They give you way more coding options and power than just using the stuff that R comes pre-loaded with - imagine buying an iPad and then not downloading any new apps to it! That wouldn’t be very fun.

One package we’ll be using today is called tidyverse. Actually, the tidyverse contains several R packages that are incredibly useful for data analysis and visualization. Think of it like “Microsoft Suite” (which I guess is what we used to call “Microsft Office”) – it’s a collection of pieces of software to do certain specific things, like Word, PowerPoint, Excel, etc. By installing and loading tidyverse, you’re availing yourself of a bunch of individual specialized packages including ggplot2, dplyr, tidyr, readr, and some others. The tidyverse is also a philosophy for maintaining datasets and general data and code hygiene; for instance, it allows us to construct our code in a different way from the default R language (which we call “Base R”).

If you’re connected to the internet, you can install/update a package with the following code:

install.packages("tidyverse")

You’re not done yet! You now “own” the package but you still need to pull the package “off the shelf” to use the code. You can import the package with the library() function. Yes, you do have to write this line of code every time you restart R and want to access the code again! That means that most of your scripts will begin with two bits of code: reading in the data you want to work with and reading in the packages you need.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Alternatively, you can click the “Packages” tab in the lower righthand corner of RStudio, and check the box for the packages you want to use.

Don’t worry if some warnings come up when you load in packages or run other code. They just contain information like the version of R that they were built in. You usually won’t have to worry about this. “Warnings” are often fine, while “errors” are almost always bad.

Introduction to dplyr

One of the most useful packages within the tidyverse is dplyr. If you’ve already loaded in the whole tidyverse using library(tidyverse), then dplyr has been loaded and you’re ready to go! In the future, though, if you’d like you you can load the individual packages within tidyverse instead of everything at once. In the case of dplyr, you’d run:

library(dplyr)

dplyr has a few very useful functions:

filter()
select()
mutate()
group_by()
summarize()

Note that there are also ways to do all of these things in Base R, but dplyr code is sometimes a little more logical and streamlined.

filter()

You can select just a subset of the rows in the dataset using the filter() function. This works really similarly to the Base R subset() function; in fact, in this simple case, they look basically identical. Let’s create some new dataframes (which will show up in our Environment at upper right) with of just the jury data from Level 1 (beige and boring) and just the data from Level 2 (after being introduced to the robot jury scenario).

jury_level1 <- filter(jury, Level == "1")
View(jury_level1)

jury_level2 <- subset(jury, Level == "2")
View(jury_level2)

The pipe %>%

Unlike Base R, dplyr lets you use what’s called a pipe, which is written like this: %>%

The pipe allows for an assembly line of functions: you start with the original dataset, then apply a function to that dataset. Using %>% allows you to think of your code as a sequence of actions.

For instance, the two sets of code below do the exact same thing:

jury_level1 <- filter(jury, Level == "1")

jury_level1 <- jury %>% 
  filter(Level == "1")

In the first set of code, we type out the “verb” – in this case, filter() – then within the parentheses we define the dataframe of interest as jury, followed by an operation on the variable Level. In the second set, we type out the “topic” – in this case, our dataframe jury – and then after the pipe we tell it what “verb” we want to operate on the data – in this case, filtering based on a particular variable called Level. In both cases, the leftmost part of the code will be the name we want our new dataframe to be called, plus the assign <- function.

You may find that you sometimes prefer using Base R syntax and functions for some things and prefer to use dplyr’s functions and pipes for other things. It’s fine to mix and match!

Note: %>% should always have a space before it, and should usually be followed by a new line. After the first step, each line should be indented by two spaces. This structure makes it easier to add new steps (or rearrange existing steps) and harder to overlook a step.

Typing %>% over and over can be tedious! Thankfully, RStudio provides a keyboard shortcut for inserting the pipe operator into your R code.

On Mac type shift + command + m.

On Windows type shift + control + m.

It may not seem totally intuitive at first, but this shortcut can handy once you get used to it.

Getting descriptive statistics: group_by() %>% summarise()

Allowing multiple pipes brings us to one of the most useful sequences of functions: group_by() %>% summarize()

SUPER USEFUL!

Remember the summary() function from the first tutorial? Well, that was useful, but let’s say we wanted to get the mean sameness rating for each of our levels. In Base R, we would have to create four separate subsets of the data, and run the mean() (or summary()) function on all four subsets. That’s effortful. Instead, we can use the group_by() function to create subsets, and then derive the mean using the summarize() function (or summarise(), which also works).

In this example, we’ll store these means in a new dataset called same_means:

same_means <- jury %>% 
  group_by(Level) %>% 
  summarise(mean_sameness = mean(SameAnswer))

View(same_means)

The above line of code takes the jury dataset, groups by the Level category, then gets the mean of the SameAnswer column and fills that value into a column we’re calling “mean_sameness” in same_means dataframe. This is great, but wouldn’t it be even better if we could distinguish between the same- and different-speaker pairs, too?

same_means <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarise(mean_sameness = mean(SameAnswer))

View(same_means)

We could also reverse the order of the grouping; note how the calculations of the means don’t change, but we might prefer either seeing the rows clustered by Level or by CorrectAnswer, depending on what our research questions are.

same_means <- jury %>% 
  group_by(CorrectAnswer, Level) %>% 
  summarise(mean_sameness = mean(SameAnswer))

View(same_means)

You can have all sorts of functions embedded in the summarise function. These will be useful to find means, medians, standard deviations, etc.

mean()
sd()
median()
max()
min()
length() – this is one way to get the number of tokens

same_means <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarize(mean_sameness = mean(SameAnswer), 
            sd_sameness = sd(SameAnswer))

same_means <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarize(mean_sameness = mean(SameAnswer), 
            sd_sameness = sd(SameAnswer),
            median_sameness = median(SameAnswer))

same_means <- jury %>% 
  group_by(Level, CorrectAnswer) %>% 
  summarize(mean_sameness = mean(SameAnswer), 
            sd_sameness = sd(SameAnswer),
            median_sameness = median(SameAnswer), 
            max_sameness = max(SameAnswer), 
            min_sameness = min(SameAnswer), 
            count = length(SameAnswer))

View(same_means)

Note that within the summarize() function, on the left side of the = we specify a new name that we make up for each column in our output, and on the right side of the = we specify the function to run on a column in the input. There’s nothing special about the word “mean_sameness” – we could cal this column “meansame” or “mean_sameness” or “Rumpelstiltskin” or anything we wanted. It’s best if we choose a clear, unique, descriptive name that’s short enough to easily type.

Get unique elements

Use the unique() function to see the individual categories in a column

unique(jury$Participant)
unique(jury$Audio1Accent)

Get the length of unique elements

You can use length(unique()) to get the number of unique categories in a column. The length() function is simply wrapped around the unique() function.

length(unique(jury$Participant))

This will tell us the number of participants who took the experiment. You could alternatively write two lines of code to get this:

unique_items <- unique(jury$Participant)
length(unique_items)

Or even use the pipe %>% :

unique(jury$Participant) %>%
  length()

Simple if-else statements

If-else statements are a staple of almost all programming languages. As their name suggests, these are statements that make a certain change to an element if a condition is met, otherwise (or else), it makes a different change. For example, in this experiment, people heard pairs of sound files with voices that might have been same or different; in most cases, the pair of files were voices with the same accent, but sometimes we compared a Middlesbrough voice with a Newcastle voice. We could create a new column in the dataset that tells us whether the first and second audio files in a pair were matched or unmatched for accent.

There are a few ways to write if-else statements in R. One way is using the ifelse() function. ifelse() takes three arguments: the condition that needs to be met (jury$Audio1Accent == jury$Audio2Accent), what to do if that condition is TRUE (indicate “matched”), what to do if the condition is FALSE (indicate “unmatched”). Here we create a new column called accent_match (on the left side of the line) using an if-else statement:

jury$accent_match <- ifelse(jury$Audio1Accent == jury$Audio2Accent, "matched", "unmatched")
View(jury)

Getting help

You can get help using a question mark preceding the function that you want more information on. You can also get help via Google or the Help tab in the lower righthand corner.

?mean
?max
?filter
?subset

Data visualization in R with ggplot2

It’s possible to do a lot of decent data visualisation just using Base R (that is, just the functions that come pre-loaded with R). YaRrr! Chapter 11.1-11.5 demonstrates some of this Base-R plotting. But tidyverse contains a package called ggplot2 (commonly referred to as just “ggplot”) which adds flexibility and has been widely adopted by linguists and other scientists. If you have installed tidyverse and loaded it in using the library() or require() functions, then ggplot2 will automatically be loaded too. You can read more about the background to ggplot2 here.

ggplot code is fairly templatic, and works by creating “layers” to a plot. You will always need the following three components in your ggplot code:

the ggplot() function that tells R you want to create a graph with some data frame
the geom_X() function where X refers to the specific type of plot you want to make (histogram, scatterplot, etc.). We’ll go over some possible geom_X functions below
an aesthetics mapping aes() that tells R how to arrange the variables on the plot

Histograms

When you want to plot the distribution of one continuous variable, you can use a histogram. Here, we’ll demonstrate with participants’ self-rating of familiarity with Standard Southern British English. (Let’s set aside debates over whether Likert-style ordinal ratings are truly quantitative, and treat this variable as continuous.)

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity))

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The stat_bin() warning simply tells you that it’s using a default binwidth for grouping the counts together; here it’s chosen to chop up all possible values from 0 to 100 so that they’re in 30 different ‘bins’ each of which is about 3.33 units wide. It’s usually safe to ignore warnings like this.

As we can see, most people rate themselves as very familiar with SSBE.

Density plots

Density plots are very similar to histograms, but instead of dealing with bins of observations and their raw counts, these estimate “kernel density”. No need to know what this is, but basically this is just a smoothed version of a histogram.

ggplot(jury) + 
  geom_density(aes(x = SSBEFamiliarity))

Boxplots

When you want to plot the distribution of one continuous variable, but for separate categories, you can use a boxplot.

ggplot(jury) + 
  geom_boxplot(aes(x = CorrectAnswer, y = SameAnswer))

Here, we can see that listeners’ answers on a 0-100 scale for the question “Are these the same or different people?” differ by whether the pair was actually the same or different; when the correct answer is that they’re different people, respondents give lower ratings for sameness than when the correct answer is that they’re the same person.

Scatterplots

When you have two continuous variables and you want to show the relationship between them, you’ll want a scatterplot. Think carefully about which variable you’ll want on the x-axis, and which one on the y-axis:

ggplot(jury) + 
  geom_point(aes(x = SimilarityAnswer, y = SameAnswer))

Here we’re plotting people’s 0-100 answer to “How similar are these voices” on the x-axis and their 0-100 answer to “Are these the same or different people?” on the y-axis. In RStudio, kyou can click the “Zoom” button in the Plots tab to see a larger version of the plot. One of the things we see here is that if people give low ratings for similarity, they’re very unlikely to give high ratings for sameness. There seems to be more instances of people giving high ratings for voice similarity but also saying they’re pretty sure that they’re not actually the same person.

Bar graphs

When you want to display the count or proportion of different levels of a categorical variable, you’ll want a bar graph:

ggplot(jury) + 
  geom_bar(aes(x = Level))

We see here that when this experiment was run, we collected equal amounts of data for Level 1 (when it’s boring and beige) and Level 2 (when we first introduce the jury vs. robot scenario). Because we had two versions of Level 3 which we ran on different participants, we have less data for each of those levels. We have a small amount of data for a version of Level 3 where we gave listeners an extra piece of “evidence” in the “case” (showing them a little picture of a DNA strand, footprint, or fingerprint); we have a larger amount of data for the version of Level 3 where we gave listeners access to an “expert witness” who supplied them with “testimony” in favour of the voices belonging to the same or to different people.

Layering

Facetting

Let’s say we want a separate scatterplot looking at the relationship between similarity and sameness rating for each correct answer condition separately. We can use facet_wrap(~CorrectAnswer) to split this up:

ggplot(jury) + 
  geom_point(aes(x = SimilarityAnswer, y = SameAnswer)) + 
  facet_wrap(~CorrectAnswer)

You can combine facet_wrap() with any plot. Here’s an example with a histogram:

ggplot(jury) + 
  geom_histogram(aes(x = SameAnswer)) + 
  facet_wrap(~CorrectAnswer)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

For every five different-speaker sound pairs in this experiment there were just three same-speaker pairs, so the raw numbers for the different-speaker conditions are higher. That being said, it is also apparent here that the same-speaker pairs were trickier for listeners than the different-speaker pairs. The mode of the different-speaker pair distribution is clearly very low; for the same-speaker pairs, the sameness answers seem to be bimodally distributed, showing about the same number of high sameness ratings as low ratings. We can plot these as density plots instead of histograms to verify that the shape of the same-answer distribution is more strongly bimodal than the different-answer distribution.

ggplot(jury) + 
  geom_density(aes(x = SameAnswer)) + 
  facet_wrap(~CorrectAnswer)

Axis labels

It’s important to put informative labels on your x- and y-axes, and your column names might not provide that. You can use the xlab() and ylab() functions to add an axis label.

For example:

ggplot(jury) + 
  geom_density(aes(x = SameAnswer)) + 
  xlab("Sameness rating") + 
  ylab("Density")

Axis limits

Sometimes you want to truncate the displayed data on the axes, or make sure it’s very consistent. You can specify an exact range by using the functions xlim(number1, number2) and/or ylim(number1, number2). Compare the plots below. Perhaps we want to just zoom in on the SSBE familiarity responses above 75, in which case we could set the x-axis limits to be between 75 and 101 (or any maximum more than 100 – setting it to 100 would exclude all the reponses that were exactly 100, which wouldn’t be good.) Or we could zoom in on just lower responses and manually select a y-axis range.

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity))

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity)) +
  xlim(75, 101)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 14256 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity)) +
  xlim(-1, 80) +
  ylim(-1, 3000)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 18888 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Colors

There are so many colors in R!

There are confusingly two ggplot arguments that will make use of color:

the color argument
the fill argument

The color argument actually just changes the outline color. The fill argument will refer to the color of a column in e.g., a bar graph or histogram.

We’ll first start with an example where we keep the outline color constant across every variable in the figure. When you want everything to be the same color, you put the color variable outside of the aes() function, but within the geom_X() function:

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity), color = "rosybrown2")

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

We can do the exact same thing with fill. Note the difference:

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity), fill = "rosybrown2")

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

You can also combine the two; order doesn’t matter:

ggplot(jury) + 
  geom_histogram(aes(x = SSBEFamiliarity), color = "black", fill = "rosybrown2")

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The second set of examples will be where the color changes with the level of a categorical variable. This can be a pretty powerful mapping in your figures. For example, we can have boxplots of the sameness ratings plotted on the x-axis, broken up by level on the y-axis, and further use color to split the boxplots by same- and different-speaker pairs. To do this, we put color or fill inside the aes() mapping. When we do this, the color or fill is mapped to some variable. Note that we then lose the immediate ability to specify the color that R uses, but we’ll return to that in a moment. Also important to note is that there are several options that can go inside the aes() mapping besides color or fill including the size of a point, the shape of a point, the line type (dashed or solid), etc. I recommend looking this up online if it interests you.

ggplot(jury) + 
  geom_boxplot(aes(x = Level, y = SameAnswer, fill = CorrectAnswer))

Further examples:

ggplot(jury) + 
  geom_boxplot(aes(x = Level, y = SimilarityAnswer, fill = CorrectAnswer))

ggplot(jury) + 
  geom_boxplot(aes(x = Level, y = SameAnswer, fill = Audio1Accent))

Change the color of the mapping

You can see how R used some default colors to map gender to the boxplot. You can also overwrite this using the scale_fill_manual() function. Let’s take look at how this works:

ggplot(jury) + 
  geom_boxplot(aes(x = Level, y = SameAnswer, fill = CorrectAnswer)) +
  scale_fill_manual(values = c("rosybrown2", "lightblue"))

There are also some automatic color scales in R, and ones that work for colorblindness, etc. Some more info can be found by googling “color palettes” in R, and some good examples are on this page.

Background theme

You can change the “theme” of the R plot, which generally includes the background color, the presence of grid lines, the font size, and a few other things using some theme_X() functions. The default theme with the gray background is theme_gray(); many people like the theme_bw() or theme_classic() backgrounds. There are several more theme_X() options in the ggthemes package if you’d like to install that and experiment. Here you can see what it looks like with the theme_classic() function. You can put the font size in the parentheses if you want, or leave a number out and have it choose a default size. Note also the specification of the binwidth in this histogram, if you wanted to specify that bins should be 10 units wide:

ggplot(jury) + 
  geom_histogram(aes(x = SameAnswer), fill = "rosybrown2", color = "black", binwidth = 10) + 
  xlab("Sameness Response") + 
  theme_classic(20)

Saving images

You can save an image using various file formats. The simplest way to save an image is to click on the Export… button in the Plots panel. You should have the option save either as a PNG or as a PDF file there. You may need to play around with the sizes and settings to get one that’s a decent resolution for PNGs. The Export option for Save as PDF is a little easier to work with, but note that PDF images often do not transfer well between Mac and Windows computers.

An alternative to clicking around at the bottom right is to use the ggsave() function. You specify the name and location of the new file you want to create, and then the height and width in inches and the resolution you’d like it to be (usually 300 is fine).

ggsave("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/Jury/plot1.png", width = 7, height = 7, dpi = 300)

Practice

Create an R script to save the answers to these questions.

Data wrangling practice

Import ‘L2_English_Lexical_Decision_Data.csv’ into R and call it ‘lex’. This data set contains reaction times (RT) in milliseconds to words and nonwords of English from L2 English speaking participants. More info about the data and project here.
Create a subset of lex using filter() that contains only the data points where the dominant language (lex$domLang) is not English.
Create a variable called ‘langs’ that contains a list of the unique dominant languages in the newly created subset.
Get the number of unique languages in ‘langs’ using code. (Don’t just look at the environment window.)
Get the number of unique participants in the subset. Participant IDs are in the ‘workerID’ column.
From the subset of non-English participants, remove data points that have reaction times below 500 ms and above 2000 ms. This will likely require two steps.
For each participant in this new subset, get the mean reaction time, the standard deviation of the reaction time, and the median reaction time. Store this data in a dataset called ‘subj_data’.
What is the mean, median and range of by-participant means?
What is the mean, median and range of by-participant standard deviations?
What is the mean, median and range of by-participant medians?
For each dominant language in the new subset, get the mean reaction time. Which language has the lowest mean, which has the highest mean?
For each dominant language in the new subset, get the number of unique participants. Hint: in the summarise() function you will need to use length(unique()).

Data visualization practice

Load in ‘L2_English_Lexical_Decision_Data.csv’ again R and call it ‘lex’, so we’re starting over from the original dataset.
Create a histogram of the reaction times (RT column). It will probably look weird because of the extreme range of RTs.
Get a summary of the RT values, and take a look at the maximum value. Something can’t be right there.
Create another histogram of the reaction times where the x-axis only includes values from 0 to 5000 ms.
Based on the histogram, it looks like values above 2500 seem pretty improbable. Create a subset of lex which only retains reaction times below 2500 ms. Call this lex again.
Create another histogram of reaction times using the new dataset and change the background theme, along with the outline color and fill of the bars.
Let’s say we’re interested in comparing reaction times of L1 English speakers against L2 English speakers. Create a new column called ‘L2’ that indicates whether a speaker’s dominant language is English or not.
Create a boxplot with L2 status on the x-axis and the reaction time on the y-axis.
Let’s also say we might want to look at any potential effects of gender. Create another column in lex called ‘gender’. If the sex of the participant is equal to 1, then the new value in the ‘gender’ column should be “m” for male, otherwise the new value should be “f”.
Recreate the same boxplot in 8, but use the aesthetic mapping to map color to the gender column.
Now recreate the same boxplot, but also provide more descriptive axis labels.
Create a new column called ‘accuracy’. If the column ‘acc’ is equal to 1, then the value in ‘accuracy’ should be equal to “right”, otherwise the value should be equal to “wrong”.
Create a bar graph with accuracy on the x-axis and the fill of the bar mapped to the L2 status. This should create what’s called a “stacked” bar graph, where the number of “right” answers for L1 English speakers is stacked on top of the number of “right” answers for L2 English speakers.
To create a “dodged” bar graph where the bars are next to each other, you’ll need to use the following argument:

geom_bar(aes(x = accuracy, fill = L2), position = "dodge")

Now take the same graph from 14 and change the colors and background theme.