First step: let’s load in the data and required packages. The dataset we’ll use is the standard McMurray & Jongman dataset of fricative and vowel measures.

library(tidyverse)

mcm <- read_csv("/Users/Thomas/Library/CloudStorage/OneDrive-YorkUniversity/LING 3300/Datasets/McMurrayJongmanFricativeAcoustics.csv")

Correlation

Correlations measure the strength of association between two variables. Several types of correlation exist; in this course, we have mainly discussed the most common type, which is Pearson’s correlation. Pearson’s correlation measures the strength to which two continuous variables are linearly related to each other. Correlations range from -1 to +1, with 0 indicating no relationship between the two variables, +1 indicating a perfect positive relationship, and -1, a perfect negative relationship.

The research hypothesis involves a statement that there exists a relationship between the two variables (r ≠ 0); the null hypothesis involves a statement that no relationship exists between the two variables (r = 0).

As an example, let’s consider some of the vowel measures in this dataset. While the Midterm and Assignment 2 have focused on the measures of fricative acoustics, this dataset also contains measurements of the vowels uttered just after each fricative. Let’s consider the difference between talker mean vowel duration for /a/ and /i/. With the correlation, we can ask whether the relationship between these means is predictable. That is, if a talker’s mean vowel duration for /a/ is long, is that talker’s mean for /i/ also long? This would be a positive correlation.

Let’s first process our data so we have the by-speaker, by-vowel means of the vowel duration.

spkr_vowel_means <- mcm %>% 
  group_by(Talker, Vowel) %>% 
  summarise(mean = mean(dur_v, na.rm = T)) # Can remove NAs from consideration this way, or use na.omit()

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Talker and Vowel.
## ℹ Output is grouped by Talker.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Talker, Vowel))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

spkr_means_ai <- subset(spkr_vowel_means, Vowel == "a" | Vowel == "i") # Just look at /a/ and /i/

spkr_means_ai <- pivot_wider(spkr_means_ai, 
            names_from = Vowel, 
            values_from = mean)

Just from eyeballing this dataframe, it seems like /a/ is pretty consistently longer than /i/. This makes sense based on what we know about vowel-intrinsic duration: pronouncing lower vowels requires your jaw to open wider than for high vowels, so it takes longer for us to produce /a/ than /i/. We can carry out a quick t-test to see if this is indeed true.

t.test(spkr_means_ai$a, spkr_means_ai$i, paired = T)

## 
##  Paired t-test
## 
## data:  spkr_means_ai$a and spkr_means_ai$i
## t = 17.841, df = 19, p-value = 2.512e-13
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  37.31523 47.23393
## sample estimates:
## mean difference 
##        42.27458

Yep, looks like /a/ is on average 42 ms longer than /i/, and this difference is significant (p < 0.001). Let’s now create a quick visualization to see if we suspect the correlation to be there.

ggplot(spkr_means_ai, aes(x = a, y = i)) + 
  geom_point()

See how those points seem to fall on a straight line? This suggests to us that the relationship between the mean vowel duration of /a/ and mean vowel duration of /i/ is very strong – it’s there! (What’s happening is probably that people have different baseline speech rates, and those who speak more slowly overall will have longer /a/ and longer /i/, while those who speak more quickly will have both shorter /a/ and shorter /i/.)

So let’s use cor() and/or cor.test() to get a measure of the strength of that relationship:

cor(spkr_means_ai$a, spkr_means_ai$i) # only r value

## [1] 0.9077989

cor.test(~ a + i, spkr_means_ai) # r and p value

## 
##  Pearson's product-moment correlation
## 
## data:  a and i
## t = 9.1831, df = 18, p-value = 3.26e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7776927 0.9633306
## sample estimates:
##       cor 
## 0.9077989

The correlation is very strong (magnitude) and positive (direction) (r = 0.91, p < 0.001). This indicates that a talker’s mean vowel duration for /a/ increases, so does /i/, and in a highly predictable manner.

Linear Regression

Simple linear regression

In this example, we’ll predict a speaker’s mean F2 value from their mean F1, ignoring vowel information. We have to use the speaker means still because of the independence of observations assumption. From this model, we can obtain three major components:

the best-fit line (y = mx + b)
whether the effects of m (slope) and b (intercept) are significantly different from 0
how much variation in vowel F2 can be accounted for by vowel F1 (R^2)

First, we’ll get the speaker means and make a plot to see how linear the relationship is between these two variables. We’ll put F2 on the y-axis because that is the variable we are predicting, and we’ll include the best-fit regression line to visualize the regression model in 2D:

spkr_means <- mcm %>% 
  group_by(Talker) %>% 
  summarise(mean_f1 = mean(F1, na.rm = TRUE), 
            mean_f2 = mean(F2, na.rm = TRUE))

ggplot(spkr_means, aes(x = mean_f1, y = mean_f2)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  theme_classic() +
  xlab("speaker mean F1 (Hz)") + 
  ylab("speaker mean F2 (Hz)")

We can see from the line that we should expect a positive effect of F1 on F2 (positive slope). We can also see that the intercept is well above 0; however, this isn’t all that meaningful because F1 and F2 values are always well above 0 anyway. Some people will center each variable on 0, and then the intercept is a little more interpretable. We can use the scale() function, which takes three arguments: the vector to operate on, whether centering should be applied, and whether scaling should be applied. If scaling is set to TRUE, then it gives a z-score; otherwise, if it’s set to false, it just subtracts the mean from each point. We’ll set scale to FALSE and just center the data:

spkr_means$centered_f1 <- scale(spkr_means$mean_f1, center = TRUE, scale = FALSE)

ggplot(spkr_means, aes(x = centered_f1, y = mean_f2)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  theme_classic() +
  xlab("centered speaker mean F1 (Hz)") + 
  ylab("speaker mean F2 (Hz)")

From this visualization, we can now interpret that when the predictors (just F1 here) are 0, we get the mean of the F2 dimension (around 1778 Hz). So the intercept is just the mean of the dependent (predicted) variable, when the predictors are 0.

Now let’s actually get the information from the linear regression model:

fit_centered <- lm(mean_f2 ~ centered_f1, spkr_means)
summary(fit_centered)

## 
## Call:
## lm(formula = mean_f2 ~ centered_f1, data = spkr_means)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -254.61  -79.79  -28.12   94.60  235.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1778.0778    25.8224  68.858   <2e-16 ***
## centered_f1    1.2393     0.9602   1.291    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 18 degrees of freedom
## Multiple R-squared:  0.08471,    Adjusted R-squared:  0.03386 
## F-statistic: 1.666 on 1 and 18 DF,  p-value: 0.2131

From this output, the key info is in the Coefficients table. You might also be interested in the Adjusted R-squared value. The adjusted R-squared value ranges from 0 to 1. In this case, it’s 0.03, which is ridiculously low: it means that F1 accounts for approximately 3% of the variation in F2. (My hunch is that vowel quality does a much better job predicting F2 than F1… the phoneticians should have the intuition that F1 varies independently of F2).

This leads us to the next critical component of the model: is F1 a significant predictor of F2? Are they significantly related? Note that the Coefficients table has two lines: intercept and centered_f1, in other words, the b and the m in our equation y = mx + b. The value of “m” or “b” is in the Estimate column, so here it would by y = 1.24x + 1778.08. In other words, to get a speaker’s F2 from their F1, the best linear prediction is to scale the F1 value by 1.24 Hz and add 1778.08 Hz to it. This is cool; unfortunately, the model fit as shown in the adjusted R-squared value is awful, so we won’t be all that accurate.

Next to the Estimate column in the table, we get our standard error, our t-value, etc. These components are familiar from our confidence intervals and t-tests. This is basically a t-test, but multiple t-tests are being run at once. The p-value for our critical predictor of F1 is p = 0.21. If our alpha is at 0.05 as is standard, then p = 0.21 is not less than 0.05, so F1 is not significantly related to F2. In other words, the F1 estimate of 1.24 is not significantly different from 0.

(Note that most people don’t bother to adjust for multiple comparisons within a single regression model, even if there are multiple predictors. For more complex models, the functions we’ll be learning in future tutorials usually do all the p-value adjustments for you automatically. As mentioned in class, you’ll also find various opinions in the statistics literature about when to correct your p-value.)

The intercept is significantly different from 0, with p < 0.05, but this is not that interesting. Anyone who knows anything about F2 knows that it will always be considerably higher than 0. Most studies will usually omit discussion of the intercept for this reason. That said, it’s cool to see that when you scale your predictors, the intercept is interpretable as the mean of your dependent variable.

Practice with correlations and regressions

Calculate the by-speaker means of M1 and MaxPF. Using these means, plot the relationship between these two measures of fricative frequency (both in Hz). Are they significantly correlated? How strongly?
Run a simple linear regression predicting a talker’s MaxPF from their M1. Is MaxPF significantly predicted by M1?
What is the multiple R-squared (unadjusted) from the linear regression model in Q2? Take its square root. Compare it to the Pearson’s R found for Q1. Feels good, huh?
Now calculate the by-speaker, by-fricative means of M1 and MaxPF. Using these means, plot the relationship between these two measures of fricative frequency, with the colour of the best-fit regression line differing by fricative. Do some of the fricatives appear to have a stronger correlation between the two measure than others?
Run a simple linear regression predicting a talker’s MaxPF from their M1, but separately for each fricative. For which fricative does a 1 Hz increase in M1 result in the largest predicted increase in MaxPF (that is, which slope is the most highly positive)? For which fricative does M1 best predict MaxPF?

LING 3300 - Week 6-7: Correlations and Linear Regression

Thomas Kettig

February 24, 2026

Correlation

Linear Regression

Simple linear regression

Practice with correlations and regressions