Introduction to R and RStudio

R is a computer programming language that is particularly well-suited for statistical analysis and data visualisation, but also can handle text processing and phonetic analysis, among other functions. For data science, R is typically considered one of the best programming languages alongside Python; it is one of the most popular programming languages along with C, C++, Python, and JavaScript.

RStudio is a user-friendly interface for running the R language. If you click on just the R application, it displays what’s called a “console” that is a very simple interface that allows you to enter code interactively. You write a line, and it prints out the response, very much like a calculator. RStudio also has a console at the bottom left space of your screen. You’ll see that each line begins with a ‘>’ and a blinking cursor; this ‘>’ means you should start entering your code there.

In addition to having a console down below, RStudio also has a display window that takes up the upper left portion of the screen. Here you view objects like your datasets and scripts (which contain saved code). This is one of the most useful aspects of RStudio.

In the upper righthand corner is your environment window. This lists all of the objects you have saved in your workspace. These objects might be things like datasets, general variables, etc.

Finally, in the lower righthand corner is another display window with a few tabs: Files, Plots, Packages, Help, and Viewer. We won’t cover all of these, but as the module progresses, you’ll see that any plot you create will be displayed in this window. If you need to access the manual to look up how a specific R function, the manual in the Help tab will also be displayed here.

Note that you can customize how this all looks! Is the font size too small, or does the bright white screen bother your eyes? Feel free to customize RStudio to work best for you. Under RStudio>Preferences (on a Mac), RStudio>Options (on a Windows) you can find the Appearances option to change aspects of the editor window.

Now that you’ve had a very brief orientation, let’s get started! For more detail, I recommend the introductions to R and RStudio in yarRrr! The Pirate’s Guide to R as well as Bodo Winter’s textbook Statistics for Linguists: An Introduction Using R.

R as a calculator

Mathematical operators

  • + addition, - subtraction, * multiplication, / division, ^ power, () parentheses / bracketing
  • these characters are generally reserved for these mathematical operations
  • spacing generally doesn’t matter, but please keep your script clean - messy scripts are incredibly difficult to read and follow and more likely to contain mistakes
3+5
3 + 5
3                      +5
3-5
3*2
3/2
3^2
3^0.5

R follows the mathematical order of operations. From left to right, it will first evaluate:

  • elements in parentheses, brackets
  • exponents, powers, indices, logs (from left to right)
  • multiplication/division (from left to right)
  • addition/subtraction (from left to right)
3+5*2
(3+5)*2

Mathematical functions

R has several built-in functions that you can use. Try not to overwrite built-in functions when you start to use variables. A function (the thing just before the parentheses) takes at least one argument, and can frequently also take more than one argument. The following functions demonstrate the use of one argument:

Square root function:

sqrt(4)

Square root expressed with an operator instead of the function (since the square root is the same as raising a number to the power of 1/2):

4^(1/2)

As for logarithmic functions (i.e. the inverse of the exponential): log functions can have different bases. The default log(x) uses base e, the natural log (aka ln); you can also specify log10(x) which is log base 10. Related built-in functions include:

  • sine function: sin(x)
  • cosine function: cos(x)
  • tangent function: tan(x)
  • exponent function (e to the power of x; inverse of log(x)): exp(x)

Right now, you do not need to remember what all of these functions mean yourself. The goal is to simply see how R works, how functions in R work, and how operators in R works. One of the main uses of R is as a very powerful calculator – exactly what one needs for statistical analysis.

log(1)
log(2)
log10(2)
sin(2)
tan(2)
exp(2)

Comparing numbers and strings

Why’s this useful? When you start extending this to lots of data, you might to generate truth values on two whole columns of data at once. One practical example is perhaps calculating accuracy in a dataset: did a participant’s response (indicated in column 1) match the actual answer (indicated in column 2)?

  • is equal to: ==
  • is not equal to: != (the exclamation point typically means NOT)
  • less than: <
  • greater than: >
  • less than or equal to: <=
  • greater than or equal to: >=

These statements will return a value of TRUE or FALSE (also known as a “logical”).

1 == 2
1 == 1
"linguistics" == "lingiustics"
1 != 2
1 != 1
"linguistics" != "lingiustics"
1 < 2
1 > 2
1 <= 2
2 >= 2

Variables

Ok, as much as this is fun, R is a programming language and in order to use its full potential (and do interesting things), we need to assign values to objects, i.e. create variables. To create a variable, we need to give it a name followed by the ‘assign’ operator <-. This is the most commonly used one in the R community. On Windows and Linux, a keyboard shortcut for this operator is ALT + -; on Mac, it’s Option + -. It is directional, so you can use 5 + 3 -> x or x <- 5 + 3

5 + 3
x <- 5 + 3
5 + 3 -> y

= is another assignment operator (but it can be dodgy as it also means other things, among which ‘equal to’ - so let’s stick to the other one). It is not directional. The variable name must be on the left.

z = 5 + 3

As a consequence, the statement below won’t work:

5 + 3 = z

Note that you can reuse the variable to update the variable (as in x <- x+2):

x <- x+1
x <- x*100
y <- x*25

Constraints on variable names

  • Variable names must begin with a letter and only contain letters, numbers, _ (underscore), or . (period)
  • Spaces are not allowed
  • To make your life easier, do not name your columns with spaces (R will automatically change spaces into . when you import your file and you most likely end up wasting your time trying to figure out why your variable names are returned as error)
  • To make your life easier, do not use spaces in any variable or filename on the computer
vot <- 25
vot

Bad examples below:

vot*2
vot* <- 52

Good examples below:

SetA <- 10
setA <- 20
setA <- "cats"
seta <- "cats"

Important: R is case sensitive, so setA and seta will be considered different variables!

Some useful functions and notations

Comments

Comments are made with hashmark. The hashmark tells R not to evaluate anything written from the # to the end of the line.

# hi, this is more of a comment than a question

Listing objects

Use ls() to list the items that are currently in your environment/workspace. Personally, I don’t use this much, because you can also see this information by just looking in the Environment tab in the upper-right corner of the screen in R Studio.

ls()

Viewing objects

To view an object in the display window, you can use the View() function. As always, capitalisation matters! Again, I don’t use this one much because you can also do the same thing by clicking an object in the Environment tab in the upper-right corner of the screen in R Studio.

View(y)

Vectors

A vector is a one-dimensional list of items of the same type. Vectors are the base structure of R and you can think of them as columns in a spreadsheet. As an example, you can create a range of integers with the colon :

x <- 1:5
x
y <- 20:27
y

A sequence of numbers (also called doubles or floats) can be created with seq(). The function seq() takes at least two arguments: a start value and an end value. You can also specify a third argument indicating the interval between subsequent numbers.

x <- seq(1, 5)
x
x <- seq(1, 5, by=0.1)

You can get the length of a vector (number of elements inside it) with length()

length(x)

Accessing parts of a vector

You can use an index (numerical position) within square brackets to get the value of the cell at that position. A vector is one-dimensional, so there’s just one index inside the brackets. When we start trying to access cells within data frames, which have two dimensions (rows and columns), we’ll start to use two numbers inside the brackets.

x <- 1:50
x[20]
x[25]

You can also access parts of the vector with comparison operators or ranges. Here we’re starting to combine functions:

x[x>25]
x[20:25]

Before, we created vectors by specifying a range of numbers. We can also create a vector that is an arbitrary list of elements with the c() function. In this case, c stands for combine, concatenate, collect or whatever word makes most sense to you.

y <- c(1, 5, 9, 11, 2)
y

We can also create a list of strings; strings must be in quotes!

y <- c("cat", "dog", "mouse")
y
x

We can then use this c() function to identify items 1, 20 and 40 in a vector (a random set of indices)

x[c(1, 20, 40)]

Data types in R

There are 5 types of data in R:

You can find the type of data R thinks you’re working with with the function typeof(). IMPORTANT: R assumes that all elements in a vector are of the same type. You can not have both a character and a numeric in the same vector. This will become important when we start thinking of individual columns in a dataset as vectors.

x <- 5
typeof(x)
myIntegers <- 1:5
typeof(myIntegers)

myCharacters <- c("cat", "dog", "idunno")
typeof(myCharacters)

myLogicals <- c(TRUE, FALSE)
myLogicals
typeof(myLogicals)

myLogicals <- c("TRUE", "FALSE")
typeof(myLogicals)

test <- 1 == 1
typeof(test)

test2 <- 1+2i
typeof(test2)

What type will the following vector have?

cat_vector <- c("cat", 10.2, 1)

Data types with respect to vectors

Recall that a vector is a one-dimensional list of elements, like a column in a spreadsheet. The elements in a vector in R must all be of the same type. If there is any ambiguity in the type of element in the vector, R uses the following ranking to ensure uniformity in the data type:

character > complex > numeric (double) > integer > logical

This means that if just one element in the vector is a character, then all the elements will be characters (even if you thought you were entering integers or numerics).

x <- c(5, "cat", 62.1)
typeof(x)

Type coercion

Vectors can be coerced into other data types using the as.X functions: (as.numeric, as.character, as.integer, as.logical)

y <- c(5, "NA", 62.1)
typeof(y)
as.numeric(y)
typeof(y)
y <- as.numeric(y)
y/2

You can check the data type using typeof() or is.X (is.numeric, is.character, etc.):

is.numeric(y)
is.character(y)
typeof(y)

A note on NA: NA means not applicable – it will be treated as an empty cell. It doesn’t need quotes and it can coexist with numbers. It’s a special string that can easily be converted to a numeric and R won’t be annoyed. However, if NA is in quotes, like “NA”, it will be treated as a character.

The following vector will be a character vector because of the quotes around NA:

x <- c(100, 200.2, 120.6, "NA")
typeof(x)

We can force that vector to be numeric and NA is readily interpretable as simply an empty cell:

x <- as.numeric(x)
typeof(x)
new_x <- x - 5
new_x

Side note: When coercing a character vector to a numeric, sometimes you need to wrap the vector in as.character() before doing the numeric conversion. This arises when the vector is actually a FACTOR in R. You don’t need to know what a factor is yet, but just know that if you try to do numeric type coercion, and R returns a bunch of unexpected 0, 1, 2, 3 etc to you, then you should try this line of code instead:

z <- c(2, 20, 200)
z <- factor(z)
z
notgood <- as.numeric(z)
notgood

good <- as.numeric(as.character(z))
good

Datasets

Loading in a dataset

There are a few ways to read in a dataset. One is to use the GUI (Graphical User Interface) in RStudio: From the Environments window in RStudio, select Import Dataset –> From Text (base). This will let you navigate to a file and select how you want to name it.

An alternative is to type the code and the path directly. The path to the file goes in quotes. Note that using the Import Dataset option will automatically produce this code for you and run it in the console.

An important note about paths: The path to a file is the computer address to your file. It is important that you get the address right so the computer can locate the file. A useful life tip is to keep your project files and more generally, your computer very organized so that the paths are easy (or at least intuitive) to type out. It seems that R can sometimes spaces in pathnames, but in general be wary of using spaces because they can cause problems - so instead of giving your folders titles like LING 3300 R files, try to get into the habit of making them more like LING3300_R without spaces.

Important note part two: You may need to adjust \ symbols to / depending on your computer - sometimes this causes issues, especially on Windows machines.

Let’s now read the jury_data_basic.csv dataset file in to R, and assign it the name jury. You can do this by replacing the path in the following code with your own path to the data, or by finding it through RStudio’s GUI. If you do it through the GUI, make sure you tell it that heading is set to Yes; that way it will know that the first line of the dataset is the column titles.

jury <- read.csv("/Users/Thomas/Documents/York (Canada)/LING 3300/Datasets/Jury/jury_data_basic.csv", header = TRUE)

Viewing the dataset

Take a look at the dataset using View(). You can also view the dataset by clicking on it in the RStudio Environment window.

View(jury)

While we won’t use these functions too frequently, you can see the top of the dataset in the console using head() and bottom using tail(). These functions are important if you ever find yourself in a situation where you can’t use RStudio and have to use a basic console.

head(jury)
tail(jury)

Getting a summary of the dataset

Get the dimensions of the dataset using nrow() (number of rows) and ncol() (number of columns): Notice that you can also find this information by glancing at the Environment window.

nrow(jury)
ncol(jury)

Get a summary of all the columns in the dataset using summary(). If the vector is numeric, R will tell you the quartiles (a type of quantile which divides the number of data points into four parts, or quarters), the mean, and the number of NAs, if any.

summary(jury)

Extracting parts of the dataset

Remember that to get an element in a specific location from a vector, we could give it the index in square brackets. So if we wanted to get the 3rd element in the vector x, then we would type:

x[3]

We can extend this logic to datasets, but now we are in a two-dimensional space. The dataset has Rows and Columns, so we need to provide the coordinates in square brackets, with Rows before Columns (R before C because we’re using R). When both the row and the column are specified, R will return the individual cell.

jury[3,2]
jury[1000, 10]

We can also extract an entire row by leaving the column specification blank (the location AFTER the comma):

jury[1000,]
myrow <- jury[1000,]
View(myrow)

We can extract a subset of the dataset by specifying a RANGE of rows using the colon:

myrow <- jury[1000:1100,]
View(myrow)

We can also extract an entire column by leaving the row specification blank (the location BEFORE the comma):

mycol <- jury[,10]
View(mycol)

We can get a set of columns using our vector notation c(). We can get columns 1, 5 and 7 by specifying it as follows:

mycol <- jury[,c(1,5,7)]

The dollar sign: $

The dollar sign is incredibly useful when it comes to dealing with columns in a dataset! The dollar sign allows you to name the column that you want to refer to. For example, we can identify and extract the column “SimilarityAnswer” from the jury dataset by calling jury$SimilarityAnswer:

mycol <- jury$SimilarityAnswer
View(mycol)

We can also create new columns. For instance, if we wanted to create a column for AjustedSimilarityAnswer that takes the SimilarityAnswer given by the participant and multiplies it by 10, we could do this:

jury$AjustedSimilarityAnswer <- jury$SimilarityAnswer*10
View(jury)

Deleting columns

You can delete a column by assigning NULL to it. As with many functions in R, there are multiple ways to do this. Assigning NULL to it is something we can do in “base-R”, or the syntax that R comes along with; I prefer to use a different method for deleting columns, using syntax from the “tidyverse” package. We’ll introduce that next week.

jury$AjustedSimilarityAnswer <- NULL
View(jury)

Pasting strings together

We can paste strings together using the paste() function. In this example, we’ll create a single column that contains information on what the pair of accents presented to the listener was. To do this, we can pasting together the Audio1Accent and Audio2Accent columns with an underscore separating the two elements:

jury$AccentPair <- paste(jury$Audio1Accent, jury$Audio2Accent, sep="_")
View(jury)

The “sep=” argument refers to the character that should separate the two elements being pasted together. Since we have quotes wrapped around an underscore (“_“), then the underscore will separate the two elements. If we wanted nothing separating the two elements, then we could have quotes wrapped around nothing:

jury$AccentPair <- paste(jury$Audio1Accent, jury$Audio2Accent, sep="")
View(jury)

Practice

Before starting this practice, we’re going to set up a way to save the code we write. To do this, we’re going to create an “R script”. This is a simple text document that contains R code and has a .R extension (like .txt or .docx, etc.).

An easy way to create an R script is through RStudio itself. Select the File menu –> New File –> R Script, and a new R script should pop up in your RStudio View window (upper left). You can now paste your code in this script, and re-run it at a later date. To include non-code comments, use the # symbol.

For example, you might copy out some of the questions and put the answer in code below it, like the following:

# LING 3300 Week 1 Practice 
# Thomas Kettig
# January 8, 2026

# Practice with unit variables

# 1. Create a variable called "cats". Assign the value 200 to it.
cats <- 200

# 2. Create a second variable called "dog". Assign the value 100 to it.

There’s the example for setting your R script up, and the answer to the first question.

Practice with unit variables

  1. Create a variable called cats. Assign the value 200 to it.
  2. Create a second variable called dog. Assign the value 100 to it.
  3. Update the variable cats by dividing it by 2. Before looking at the value of cats, what do you think it should be?
  4. Hopefully you know whether cats is greater than, equal to, or less than dog. Write a line of code that tests each of these qualities. (There should be three lines of code for your answer.)
  5. Update the variable dog by adding 5 to it.
  6. Write a line of code that tests whether dog is greater than or equal to cats, greater than cats, equal to cats, less than cats, or less than or equal to cats. (There should be five lines of code for your answer.)

Practice with vectors: numerical

  1. Create a vector called myNumbers that contains a sequence of numbers from 0 to 49, increased by 1 (e.g., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …)
  2. Update the vector myNumbers by adding 1 to it.
  3. Update the vector myNumbers by multiplying it by 4.
  4. What is the 21st element of the vector?
  5. What is the 43rd element of the vector?
  6. Create a new vector from myNumbers called mySmallNumbers that contain only the elements that have values less than 20. #### Practice with vectors: strings
  7. Create a vector called colors with the values green, blue, red, and yellow in it.
  8. What is the second element in the vector colors?
  9. Get the length of the vector colors (Before you run the line of code, what do you think the answer should be?)

Practice with data types and type coercion

  1. Create a vector called ‘myNumbers’ and store three numbers in it.
  2. Create a new vector called ‘myCharacters’ that turns the myNumbers vector into characters.
  3. Create a new vector called ‘myNumbers2’ that turns myCharacters back into numbers.
  4. Create a vector called ‘mixed’ that contains a mixture of strings, numbers, and/or logicals.
  5. Figure out what the data type of the ‘mixed’ vector is.

Practice with datasets: basics

  1. Find the Lexical Decision Dataset on eClass and download it. Import ‘L2_English_Lexical_Decision_Data.csv’ into R and call it ‘lex’ for lexical decision. This data set contains reaction times (RT) in milliseconds to words and nonwords of English from L2 English speaking participants. More info about the data and project here.
  2. Get the number of rows in the dataset using the console (don’t just look at the Environment window).
  3. Get the number of columns in the dataset using the console (don’t just look at the Environment window).
  4. Get a summary of the dataset.

Practice with datasets: extracting parts

  1. Create a new variable vector called ‘myRows’ that contains rows 1 to 10 from the new dataset.
  2. Create a new variable vector called ‘myCols’ that contains columns 3 to 5 from the new dataset.
  3. Create a variable vector called ‘accuracy’ that contains the accuracy column in the dataset (‘acc’). Use the dollar sign when referring to this column.

Disclaimer: Some of these original materials were put together by Eleanor Chodroff and Elisa Passoni for the University of York. Thomas Kettig then inherited it and modified as needed, particularly based on notes by Nathan Sanders from the University of Toronto. The R software and the packages are distributed under the terms of the GNU General Public License, either Version 2, June 1991 or Version 3, June 2007 (run the command licence () for more information)