R is a computer programming language that is particularly well-suited for statistical analysis and data visualisation, but also can handle text processing and phonetic analysis, among other functions. For data science, R is typically considered one of the best programming languages alongside Python; it is one of the most popular programming languages along with C, C++, Python, and JavaScript.
RStudio is a user-friendly interface for running the R language. If you click on just the R application, it displays what’s called a “console” that is a very simple interface that allows you to enter code interactively. You write a line, and it prints out the response, very much like a calculator. RStudio also has a console at the bottom left space of your screen. You’ll see that each line begins with a ‘>’ and a blinking cursor; this ‘>’ means you should start entering your code there.
In addition to having a console down below, RStudio also has a display window that takes up the upper left portion of the screen. Here you view objects like your datasets and scripts (which contain saved code). This is one of the most useful aspects of RStudio.
In the upper righthand corner is your environment window. This lists all of the objects you have saved in your workspace. These objects might be things like datasets, general variables, etc.
Finally, in the lower righthand corner is another display window with a few tabs: Files, Plots, Packages, Help, and Viewer. We won’t cover all of these, but as the module progresses, you’ll see that any plot you create will be displayed in this window. If you need to access the manual to look up how a specific R function, the manual in the Help tab will also be displayed here.
Note that you can customize how this all looks! Is the font size too small, or does the bright white screen bother your eyes? Feel free to customize RStudio to work best for you. Under RStudio>Preferences (on a Mac), RStudio>Options (on a Windows) you can find the Appearances option to change aspects of the editor window.
Now that you’ve had a very brief orientation, let’s get started! For more detail, I recommend the introductions to R and RStudio in yarRrr! The Pirate’s Guide to R as well as Bodo Winter’s textbook Statistics for Linguists: An Introduction Using R.
3+5
3 + 5
3 +5
3-5
3*2
3/2
3^2
3^0.5
R follows the mathematical order of operations. From left to right, it will first evaluate:
3+5*2
(3+5)*2
R has several built-in functions that you can use. Try not to overwrite built-in functions when you start to use variables. A function (the thing just before the parentheses) takes at least one argument, and can frequently also take more than one argument. The following functions demonstrate the use of one argument:
Square root function:
sqrt(4)
Square root expressed with an operator instead of the function (since the square root is the same as raising a number to the power of 1/2):
4^(1/2)
As for logarithmic functions (i.e. the inverse of the exponential): log functions can have different bases. The default log(x) uses base e, the natural log (aka ln); you can also specify log10(x) which is log base 10. Related built-in functions include:
Right now, you do not need to remember what all of these functions mean yourself. The goal is to simply see how R works, how functions in R work, and how operators in R works. One of the main uses of R is as a very powerful calculator – exactly what one needs for statistical analysis.
log(1)
log(2)
log10(2)
sin(2)
tan(2)
exp(2)
Why’s this useful? When you start extending this to lots of data, you might to generate truth values on two whole columns of data at once. One practical example is perhaps calculating accuracy in a dataset: did a participant’s response (indicated in column 1) match the actual answer (indicated in column 2)?
These statements will return a value of TRUE or FALSE (also known as a “logical”).
1 == 2
1 == 1
"linguistics" == "lingiustics"
1 != 2
1 != 1
"linguistics" != "lingiustics"
1 < 2
1 > 2
1 <= 2
2 >= 2
Ok, as much as this is fun, R is a programming language and in order to use its full potential (and do interesting things), we need to assign values to objects, i.e. create variables. To create a variable, we need to give it a name followed by the ‘assign’ operator <-. This is the most commonly used one in the R community. On Windows and Linux, a keyboard shortcut for this operator is ALT + -; on Mac, it’s Option + -. It is directional, so you can use 5 + 3 -> x or x <- 5 + 3
5 + 3
x <- 5 + 3
5 + 3 -> y
= is another assignment operator (but it can be dodgy as it also means other things, among which ‘equal to’ - so let’s stick to the other one). It is not directional. The variable name must be on the left.
z = 5 + 3
As a consequence, the statement below won’t work:
5 + 3 = z
Note that you can reuse the variable to update the variable (as in x <- x+2):
x <- x+1
x <- x*100
y <- x*25
vot <- 25
vot
Bad examples below:
vot*2
vot* <- 52
Good examples below:
SetA <- 10
setA <- 20
setA <- "cats"
seta <- "cats"
Important: R is case sensitive, so setA and seta will be considered different variables!
Use ls() to list the items that are currently in your environment/workspace. Personally, I don’t use this much, because you can also see this information by just looking in the Environment tab in the upper-right corner of the screen in R Studio.
ls()
To view an object in the display window, you can use the View() function. As always, capitalisation matters! Again, I don’t use this one much because you can also do the same thing by clicking an object in the Environment tab in the upper-right corner of the screen in R Studio.
View(y)
A vector is a one-dimensional list of items of the same type. Vectors are the base structure of R and you can think of them as columns in a spreadsheet. As an example, you can create a range of integers with the colon :
x <- 1:5
x
y <- 20:27
y
A sequence of numbers (also called doubles or floats) can be created with seq(). The function seq() takes at least two arguments: a start value and an end value. You can also specify a third argument indicating the interval between subsequent numbers.
x <- seq(1, 5)
x
x <- seq(1, 5, by=0.1)
You can get the length of a vector (number of elements inside it) with length()
length(x)
You can use an index (numerical position) within square brackets to get the value of the cell at that position. A vector is one-dimensional, so there’s just one index inside the brackets. When we start trying to access cells within data frames, which have two dimensions (rows and columns), we’ll start to use two numbers inside the brackets.
x <- 1:50
x[20]
x[25]
You can also access parts of the vector with comparison operators or ranges. Here we’re starting to combine functions:
x[x>25]
x[20:25]
Before, we created vectors by specifying a range of numbers. We can also create a vector that is an arbitrary list of elements with the c() function. In this case, c stands for combine, concatenate, collect or whatever word makes most sense to you.
y <- c(1, 5, 9, 11, 2)
y
We can also create a list of strings; strings must be in quotes!
y <- c("cat", "dog", "mouse")
y
x
We can then use this c() function to identify items 1, 20 and 40 in a vector (a random set of indices)
x[c(1, 20, 40)]
There are 5 types of data in R:
You can find the type of data R thinks you’re working with with the function typeof(). IMPORTANT: R assumes that all elements in a vector are of the same type. You can not have both a character and a numeric in the same vector. This will become important when we start thinking of individual columns in a dataset as vectors.
x <- 5
typeof(x)
myIntegers <- 1:5
typeof(myIntegers)
myCharacters <- c("cat", "dog", "idunno")
typeof(myCharacters)
myLogicals <- c(TRUE, FALSE)
myLogicals
typeof(myLogicals)
myLogicals <- c("TRUE", "FALSE")
typeof(myLogicals)
test <- 1 == 1
typeof(test)
test2 <- 1+2i
typeof(test2)
What type will the following vector have?
cat_vector <- c("cat", 10.2, 1)
Recall that a vector is a one-dimensional list of elements, like a column in a spreadsheet. The elements in a vector in R must all be of the same type. If there is any ambiguity in the type of element in the vector, R uses the following ranking to ensure uniformity in the data type:
character > complex > numeric (double) > integer > logical
This means that if just one element in the vector is a character, then all the elements will be characters (even if you thought you were entering integers or numerics).
x <- c(5, "cat", 62.1)
typeof(x)
Vectors can be coerced into other data types using the as.X functions: (as.numeric, as.character, as.integer, as.logical)
y <- c(5, "NA", 62.1)
typeof(y)
as.numeric(y)
typeof(y)
y <- as.numeric(y)
y/2
You can check the data type using typeof() or is.X (is.numeric, is.character, etc.):
is.numeric(y)
is.character(y)
typeof(y)
A note on NA: NA means not applicable – it will be treated as an empty cell. It doesn’t need quotes and it can coexist with numbers. It’s a special string that can easily be converted to a numeric and R won’t be annoyed. However, if NA is in quotes, like “NA”, it will be treated as a character.
The following vector will be a character vector because of the quotes around NA:
x <- c(100, 200.2, 120.6, "NA")
typeof(x)
We can force that vector to be numeric and NA is readily interpretable as simply an empty cell:
x <- as.numeric(x)
typeof(x)
new_x <- x - 5
new_x
Side note: When coercing a character vector to a numeric, sometimes you need to wrap the vector in as.character() before doing the numeric conversion. This arises when the vector is actually a FACTOR in R. You don’t need to know what a factor is yet, but just know that if you try to do numeric type coercion, and R returns a bunch of unexpected 0, 1, 2, 3 etc to you, then you should try this line of code instead:
z <- c(2, 20, 200)
z <- factor(z)
z
notgood <- as.numeric(z)
notgood
good <- as.numeric(as.character(z))
good
There are a few ways to read in a dataset. One is to use the GUI (Graphical User Interface) in RStudio: From the Environments window in RStudio, select Import Dataset –> From Text (base). This will let you navigate to a file and select how you want to name it.
An alternative is to type the code and the path directly. The path to the file goes in quotes. Note that using the Import Dataset option will automatically produce this code for you and run it in the console.
An important note about paths: The path to a file is the computer address to your file. It is important that you get the address right so the computer can locate the file. A useful life tip is to keep your project files and more generally, your computer very organized so that the paths are easy (or at least intuitive) to type out. It seems that R can sometimes spaces in pathnames, but in general be wary of using spaces because they can cause problems - so instead of giving your folders titles like LING 3300 R files, try to get into the habit of making them more like LING3300_R without spaces.
Important note part two: You may need to adjust \ symbols to / depending on your computer - sometimes this causes issues, especially on Windows machines.
Let’s now read the jury_data_basic.csv dataset file in to R, and assign it the name jury. You can do this by replacing the path in the following code with your own path to the data, or by finding it through RStudio’s GUI. If you do it through the GUI, make sure you tell it that heading is set to Yes; that way it will know that the first line of the dataset is the column titles.
jury <- read.csv("/Users/Thomas/Documents/York (Canada)/LING 3300/Datasets/Jury/jury_data_basic.csv", header = TRUE)
Take a look at the dataset using View(). You can also view the dataset by clicking on it in the RStudio Environment window.
View(jury)
While we won’t use these functions too frequently, you can see the top of the dataset in the console using head() and bottom using tail(). These functions are important if you ever find yourself in a situation where you can’t use RStudio and have to use a basic console.
head(jury)
tail(jury)
Get the dimensions of the dataset using nrow() (number of rows) and ncol() (number of columns): Notice that you can also find this information by glancing at the Environment window.
nrow(jury)
ncol(jury)
Get a summary of all the columns in the dataset using summary(). If the vector is numeric, R will tell you the quartiles (a type of quantile which divides the number of data points into four parts, or quarters), the mean, and the number of NAs, if any.
summary(jury)
Remember that to get an element in a specific location from a vector, we could give it the index in square brackets. So if we wanted to get the 3rd element in the vector x, then we would type:
x[3]
We can extend this logic to datasets, but now we are in a two-dimensional space. The dataset has Rows and Columns, so we need to provide the coordinates in square brackets, with Rows before Columns (R before C because we’re using R). When both the row and the column are specified, R will return the individual cell.
jury[3,2]
jury[1000, 10]
We can also extract an entire row by leaving the column specification blank (the location AFTER the comma):
jury[1000,]
myrow <- jury[1000,]
View(myrow)
We can extract a subset of the dataset by specifying a RANGE of rows using the colon:
myrow <- jury[1000:1100,]
View(myrow)
We can also extract an entire column by leaving the row specification blank (the location BEFORE the comma):
mycol <- jury[,10]
View(mycol)
We can get a set of columns using our vector notation c(). We can get columns 1, 5 and 7 by specifying it as follows:
mycol <- jury[,c(1,5,7)]
The dollar sign is incredibly useful when it comes to dealing with columns in a dataset! The dollar sign allows you to name the column that you want to refer to. For example, we can identify and extract the column “SimilarityAnswer” from the jury dataset by calling jury$SimilarityAnswer:
mycol <- jury$SimilarityAnswer
View(mycol)
We can also create new columns. For instance, if we wanted to create a column for AjustedSimilarityAnswer that takes the SimilarityAnswer given by the participant and multiplies it by 10, we could do this:
jury$AjustedSimilarityAnswer <- jury$SimilarityAnswer*10
View(jury)
You can delete a column by assigning NULL to it. As with many functions in R, there are multiple ways to do this. Assigning NULL to it is something we can do in “base-R”, or the syntax that R comes along with; I prefer to use a different method for deleting columns, using syntax from the “tidyverse” package. We’ll introduce that next week.
jury$AjustedSimilarityAnswer <- NULL
View(jury)
We can paste strings together using the paste() function. In this example, we’ll create a single column that contains information on what the pair of accents presented to the listener was. To do this, we can pasting together the Audio1Accent and Audio2Accent columns with an underscore separating the two elements:
jury$AccentPair <- paste(jury$Audio1Accent, jury$Audio2Accent, sep="_")
View(jury)
The “sep=” argument refers to the character that should separate the two elements being pasted together. Since we have quotes wrapped around an underscore (“_“), then the underscore will separate the two elements. If we wanted nothing separating the two elements, then we could have quotes wrapped around nothing:
jury$AccentPair <- paste(jury$Audio1Accent, jury$Audio2Accent, sep="")
View(jury)
Before starting this practice, we’re going to set up a way to save the code we write. To do this, we’re going to create an “R script”. This is a simple text document that contains R code and has a .R extension (like .txt or .docx, etc.).
An easy way to create an R script is through RStudio itself. Select the File menu –> New File –> R Script, and a new R script should pop up in your RStudio View window (upper left). You can now paste your code in this script, and re-run it at a later date. To include non-code comments, use the # symbol.
For example, you might copy out some of the questions and put the answer in code below it, like the following:
# LING 3300 Week 1 Practice
# Thomas Kettig
# January 8, 2026
# Practice with unit variables
# 1. Create a variable called "cats". Assign the value 200 to it.
cats <- 200
# 2. Create a second variable called "dog". Assign the value 100 to it.
There’s the example for setting your R script up, and the answer to the first question.
Disclaimer: Some of these original materials were put together by Eleanor Chodroff and Elisa Passoni for the University of York. Thomas Kettig then inherited it and modified as needed, particularly based on notes by Nathan Sanders from the University of Toronto. The R software and the packages are distributed under the terms of the GNU General Public License, either Version 2, June 1991 or Version 3, June 2007 (run the command licence () for more information)
Comments
Comments are made with hashmark. The hashmark tells R not to evaluate anything written from the # to the end of the line.