7 Walkthrough 1: The Education Data Science Pipeline With Online Science Class Data
7.1 Topics Emphasized
For this and the remaining walkthroughs, we refer to the topics emphasized in terms of distinct but related steps involved in the process of data science. In this book, we use the six steps—described in detail in Chapter 3—from Wickham & Grolemund (2018)’s depiction of the process.
As mentioned in Chapter 5, then, the topics emphasized are those that are the particular focus of each chapter; most of the walkthroughs contain some element of all of the five aspects, but all have specific emphases.
For this chapter on the education data science pipeline, those emphases are:
- Tidying data
- Transforming data
7.2 Functions Introduced
data.frame()
dplyr::summarize()
tidyr::pivot_longer() and tidyr::pivot_wider()
tidyr::left_join()
,tidyr::right_join()
,tidyr::semi_join()
, andtidyr::anti_join()
lm()
ggplot2::ggplot()
apaTables::apa.cor.table()
sjPlot::tab_model()
7.3 Vocabulary
In this section (here and in the other walkthrough sections), we include key terms that are introduced and used in the chapter.
- data frame
- item
- joins
- keys
- log-trace data
- passed arguments
- reverse scale
- regression
- survey
- tibble
- vectorize
7.4 Introduction to the Walkthroughs
This chapter is the first of eight walkthroughs included in the book. In it, we present one approach to analyzing a specific dataset. In this chapter, the approach is what we refer to as the “education data science pipeline”, or some of the steps taken in many data science projects, including cleaning and tidying data as well as exploring and visualizing or modeling it.
Here, we will be using data from a number of online science classes and will show the process of working with an education dataset from start to finish. While the walkthroughs are very different, the structure and section headings will be consistent throughout them. For example, each walkthrough will begin with a vocabulary section, followed by an introduction to the dataset and an introduction to the question or problem explored in the walkthrough.
We note that this chapter assumes familiarity with the four core concepts that comprise the foundational skills framework: projects, functions, packages, and data. If you would like a refresher about (or an introduction to) any of those, reading Chapter 6 then writing and running some of the code in it may be helpful to you.
7.5 Chapter Overview
In this walkthrough, we explore some of the key steps that are a part of many data science in education projects. In particular, we explore how to process and prepare data. These steps are sometimes referred to as “data wrangling”. To do this, we rely heavily on a set of tools that we use throughout all the walkthroughs—those associated with the {tidyverse}.
The {tidyverse} is a set of packages for data manipulation, exploration, and visualization that uses the design philosophy of “tidy” data (Wickham, 2014). Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. We’ll discuss both the {tidyverse} and tidy data much more throughout the book. For more information, see the Foundational Skills chapters or https://www.tidyverse.org/.
7.5.1 Background
The online science classes we explore in this chapter were designed and taught by instructors through a state-wide online course provider designed to supplement—but not replace—students’ enrollment in their local school. For example, students may have chosen to enroll in an online physics class because one was not offered at their school. The data were originally collected for a research study, which utilized a number of different data sources to understand students’ course-related motivation. These datasets included:
- A self-report survey assessing three aspects of students’ motivation
- Log-trace data, such as data output from the learning management system (LMS)
- Discussion board data
- Academic achievement data
Our high-level purpose for this walkthrough is to conduct an analysis that helps explain students’ performance in these online courses. The problem we are facing is a very common one when it comes to data science in education: the data are complex and in need of further processing before we can get to the step of running analyses. We will use this same dataset in the final walkthrough, Walkthrough 8, in Chapter 14, where we will provide more details there about the dataset and the context.
To understand students’ performance, we will focus on an LMS variable that indicates the length of time students spent within the course LMS. We will also explore the effects of the type of science course and the section in which students are enrolled on student performance.
First, these different data sources will be described in terms of how they were provided by the school.
7.5.2 Data Sources
7.5.2.1 Data Source #1: Self-Report Survey about Students’ Motivation
The first data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. A measure is a concept that we try to make concrete and assess using survey questions. The three motivation measures we explore here come from Expectancy-Value Theory, which states that students are motivated to learn when they both believe that they can achieve something (expectancy, also known as “perceived competence”) and believe that the concept they are trying to learn is important (value) (Wigfield & Eccles, 2000). There are multiple types of value, but we explore two of them here: interest and utility value. Utility value is the degree to which a person is able to connect the concept being learned with something they will utilize in their future life. This survey included the following ten items:
- I think this course is an interesting subject. (Interest)
- What I am learning in this class is relevant to my life. (Utility value)
- I consider this topic to be one of my best subjects. (Perceived competence)
- I am not interested in this course. (Interest—reverse coded)
- I think I will like learning about this topic. (Interest)
- I think what we are studying in this course is useful for me to know. (Utility value)
- I don’t feel comfortable when it comes to answering questions in this area. (Perceived competence–reverse coded)
- I think this subject is interesting. (Interest)
- I find the content of this course to be personally meaningful. (Utility value)
- I’ve always wanted to learn more about this subject. (Interest)
7.5.3 Data Source #2: Log-Trace Data
Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings (see Chapters 11 and 12). In education, an increasingly common source of log-trace data generated from interactions with LMS and other digital tools (Siemens & d Baker, 2012). The data for this walk-through is a summary type of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., timestamps associated with when students started and stopped accessing the course).
7.5.4 Data Source #3: Academic Achievement and Gradebook Data
This is a common source of data in the education realm: the graded assignments students completed. In this walkthrough, we examine only students’ final grades for the course.
7.5.5 Data Source #4: Discussion Board Data
Discussion board data is both rich and unstructured because it is made up of large chunks of text that the students wrote. While discussion board data was collected for this research project, we do not examine it in this walkthrough. More information about analyzing text data can be found in Chapter 11.
7.6 Load Packages
This analysis uses R packages, which are collections of R code that help
users code more efficiently, as you will recall from Chapter 1.
We load these packages with the function library()
. The specific
packages we’ll use here will help us organize the structure of the data
using the {tidyverse} (Wickham, 2022),create formatted tables using
{apaTables} (Stanley, 2021) and {sjPlot} (Lüdecke, 2021), and export datasets
using {readxl} (Wickham & Bryan, 2022).
Install packages (if necessary)
If you have not installed any of these packages before, you will need to do so before loading them (if you run the code below prior to installing the packages, you should see a message indicating that the package is not available). If you have installed these before, then you can skip this step.
You can install a single package, such as the {tidyverse} package, as follows:
install.packages("tidyverse")
If you must install two or more packages, you can do so in a single call
to the install.packages()
function; the names of the packages must be
provided to the function as follows:
install.packages(c("tidyverse", "apaTables"))
When you’re installing a package for the first time (which may be needed
for the other walkthrough chapters, as well), you will need to take
these same steps, first. The good news is that you only need to install
a package once, after which you can simply load it using library()
(as below).
More on the installation of packages is included in the “Packages” section of Chapter 6.
library(tidyverse)
library(apaTables)
library(sjPlot)
library(readxl)
library(dataedu)
7.7 Import Data
This code chunk loads the log-trace data and self-report survey data
from the {dataedu} package. Note that we assign a dataset to an object
three different times, once for each of the three different datasets. We
assign each of the datasets a name using <-
.
# Pre-survey for the F15 and S16 semesters
<- dataedu::pre_survey
pre_survey
# Gradebook and log-trace data for F15 and S16 semesters
<- dataedu::course_data
course_data
# Log-trace data for F15 and S16 semesters - this is for time spent
<- dataedu::course_minutes course_minutes
7.8 View Data
Now that we’ve successfully loaded all three log-trace datasets, we can visually inspect the data by typing the names that we assigned to each dataset. Running each line, one at a time, will show the first several rows of each dataset.
pre_survey
## # A tibble: 1,102 × 12
## opdata_user…¹ opdat…² Q1Mai…³ Q1Mai…⁴ Q1Mai…⁵ Q1Mai…⁶ Q1Mai…⁷ Q1Mai…⁸ Q1Mai…⁹
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 _80624_1 FrScA-… 4 4 4 1 5 4 1
## 2 _80623_1 BioA-S… 4 4 3 2 4 4 2
## 3 _82588_1 OcnA-S… NA NA NA NA NA NA NA
## 4 _80623_1 AnPhA-… 4 3 3 2 3 3 3
## 5 _80624_1 AnPhA-… NA NA NA NA NA NA NA
## 6 _80624_1 AnPhA-… 4 2 2 2 4 4 1
## 7 _80624_1 AnPhA-… NA NA NA NA NA NA NA
## 8 _80624_1 BioA-S… 5 3 3 1 5 4 1
## 9 _80624_1 BioA-T… NA NA NA NA NA NA NA
## 10 _80624_1 PhysA-… 4 4 3 2 4 4 2
## # … with 1,092 more rows, 3 more variables: Q1MaincellgroupRow8 <dbl>,
## # Q1MaincellgroupRow9 <dbl>, Q1MaincellgroupRow10 <dbl>, and abbreviated
## # variable names ¹opdata_username, ²opdata_CourseID, ³Q1MaincellgroupRow1,
## # ⁴Q1MaincellgroupRow2, ⁵Q1MaincellgroupRow3, ⁶Q1MaincellgroupRow4,
## # ⁷Q1MaincellgroupRow5, ⁸Q1MaincellgroupRow6, ⁹Q1MaincellgroupRow7
course_data
## # A tibble: 29,711 × 8
## CourseSectionOrigID Bb_UserPK Gradeb…¹ Grade…² Final…³ Point…⁴ Point…⁵ Gender
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 AnPhA-S116-01 60186 POINTS … <NA> 86.3 5 4.05 F
## 2 AnPhA-S116-01 60186 WORK AT… <NA> 86.3 30 24 F
## 3 AnPhA-S116-01 60186 0.1: Me… <NA> 86.3 105 71.7 F
## 4 AnPhA-S116-01 60186 0.2: In… Hw 86.3 140 141. F
## 5 AnPhA-S116-01 60186 0.3: In… Hw 86.3 5 5 M
## 6 AnPhA-S116-01 60186 1.1: Qu… Qz 86.3 5 4 F
## 7 AnPhA-S116-01 60186 1.2: Qu… Qz 86.3 20 NA F
## 8 AnPhA-S116-01 60186 1.3: Cr… Hw 86.3 50 50 F
## 9 AnPhA-S116-01 60186 1.3: Cr… Hw 86.3 10 NA F
## 10 AnPhA-S116-01 60186 1.4: Ne… Hw 86.3 50 44 F
## # … with 29,701 more rows, and abbreviated variable names ¹Gradebook_Item,
## # ²Grade_Category, ³FinalGradeCEMS, ⁴Points_Possible, ⁵Points_Earned
course_minutes
## # A tibble: 598 × 3
## Bb_UserPK CourseSectionOrigID TimeSpent
## <dbl> <chr> <dbl>
## 1 44638 OcnA-S116-01 1383.
## 2 54346 OcnA-S116-01 1191.
## 3 57981 OcnA-S116-01 3343.
## 4 66740 OcnA-S116-01 965.
## 5 67920 OcnA-S116-01 4095.
## 6 85355 OcnA-S116-01 595.
## 7 85644 OcnA-S116-01 1632.
## 8 86349 OcnA-S116-01 1601.
## 9 86460 OcnA-S116-01 1891.
## 10 87970 OcnA-S116-01 3123.
## # … with 588 more rows
7.9 Process Data
Often, survey data needs to be processed in order to be most useful. We are going to start with one of the three datasets we just saved as objects into our environment: the dataset with the pre-survey items. Here, we process the self-report items into three scales for 1) interest, 2) perceived competence, and 3) utility value. We do this by:
- Renaming the question variables to something more manageable
- Reversing the response scales on questions 4 and 7
- Categorizing each question into a measure
- Computing the mean of each measure
Let’s take these steps in order:
- Take the pre-survey data, and save it as a new object with the same
name “pre_survey”. Rename the question columns to something much
simpler. To do this, we will use the
rename()
function and will input first the new variable name, then the “=” sign, and then the old variable name. You’ll notice another function at the bottom calledmutate_at()
. We’ll explain that below this code chunk.
<-
pre_survey %>%
pre_survey # Rename the qustions something easier to work with because R is case sensitive
# and working with variable names in mix case is prone to error
rename(
q1 = Q1MaincellgroupRow1,
q2 = Q1MaincellgroupRow2,
q3 = Q1MaincellgroupRow3,
q4 = Q1MaincellgroupRow4,
q5 = Q1MaincellgroupRow5,
q6 = Q1MaincellgroupRow6,
q7 = Q1MaincellgroupRow7,
q8 = Q1MaincellgroupRow8,
q9 = Q1MaincellgroupRow9,
q10 = Q1MaincellgroupRow10
%>%
) # Convert all question responses to numeric
mutate_at(vars(q1:q10), list( ~ as.numeric(.)))
Let’s take a moment to discuss the {dplyr} function mutate_at()
.
mutate_at()
is a version of mutate()
, which changes the values in an
existing column or creates new columns. It’s useful in education
datasets because you’ll often need to transform your data before
analyzing it. In the code chunk we just ran, we used mutate_at()
to
convert the data in all ten variables into a numeric format.
To learn a little more about mutate()
, try the example below, where we
create a new data frame called “df”. A data frame is a two-dimensional
structure that stores tables. The table has a header and data rows and
each cell stores values.
We fill this data frame with two columns: “male” and “female”. Each
column has only one value, and that value is 5. In the second part of
the code, we add a total_students
column by adding the number of
male
students and female
students.
Note that we create the dataset with tibble()
, which is from the
{tibble} package included in the tidyverse. A tibble is a special type
of data frame that makes working with tidy data a little easier. More
information is available in R for Data Science (Wickham & Grolemund, 2018).
# Dataset of students
<- tibble(
df male = 5,
female = 5
)
# Use mutate to create a new column called "total_students"
# populate that column with the sum of the "male" and "female" variables
%>% mutate(total_students = male + female) df
## # A tibble: 1 × 3
## male female total_students
## <dbl> <dbl> <dbl>
## 1 5 5 10
To return to our original data, we used mutate_at()
, a special version
of mutate()
which conveniently changes the values of multiple columns.
In our dataset pre_survey
, we let mutate_at()
know we want to change
the variables q1
through q10
. We do this with the argument
vars(q1:q10)
.
- Next, we’ll reverse the scale of the survey responses on questions 4 and 7 so the responses for all questions can be interpreted in the same way. As you can see from the survey questions we listed earlier in the chapter, the phrasing of questions 4 and 7 is opposite the phrasing of the other questions. Rather than write a lot of code once to reverse the scales for question 4 then writing it again to reverse the scales on question 7, we’ll build a function that does that job for us. Then we’ll use the same function that we wrote to recode both question 4 and question 7. This will result in much less code, and it will make it easier for us to change in the future.
We’ll use the function case_when()
within our function to reverse the
scale of the item responses. case_when()
is useful when you need to
replace the values in a column with other values based on some criteria.
Education datasets use a lot of codes to describe demographics, like
numerical codes for disability categories, race groups, or proficiency
in a test. When you work with codes like this, you’ll often want to
change the codes to values that are easier to understand. For example, a
consultant analyzing how students did on state testing might use
case_when()
to replace proficiency codes like 1, 2, or 3 with more
descriptive words, like “below proficiency”, “proficient”, or
“advanced”.
case_when()
lets you vectorize the rules you want to use to change
values in a column. When a sequence of criteria is vectorized, R will
evaluate a value in a column against each criterion in your
case_when()
sequence. case_when()
is helpful because it does this
without requiring you to write complicated code loops. Instead, you can
systematically evaluate your criteria using code that is compact and
readable (once you understand how all the arguments work).
The left-hand side of each case_when()
argument will be a formula that
returns either a TRUE
or a FALSE
. In the function below, we’ll use
logical operators on the left-hand side of the formula like this:
question == 1 ~ 5
. Here are some logical operators you can use in the
future:
==
: equal to>
: greater than<
: lesser than>=
: greater than or equal to<=
: lesser than or equal to!=
: not equal to!
: not&
: and|
: or
Let’s make this all concrete and use it here in our function that reverses the scale of the survey responses. In the first part of the code chunk below, we will write our function. Note that by running the first bit of code here, we won’t be changing anything in our data. Instead, we are creating a reusable, general-purpose code chunk that we will then apply to the specific survey questions we want to recode.
# This part of the code is where we write the function:
# Function for reversing scales
<- function(question) {
reverse_scale # Reverses the response scales for consistency
# Arguments:
# question - survey question
# Returns:
# a numeric converted response
# Note: even though 3 is not transformed, case_when expects a match for all
# possible conditions, so it's best practice to label each possible input
# and use TRUE ~ as the final statement returning NA for unexpected inputs
<- case_when(
x == 1 ~ 5,
question == 2 ~ 4,
question == 3 ~ 3,
question == 4 ~ 2,
question == 5 ~ 1,
question TRUE ~ NA_real_
)
x
}
# And here's where we use that function to reverse the scales
# We use the pipe operator %>% here
# Reverse scale for questions 4 and 7
<-
pre_survey %>%
pre_survey mutate(q4 = reverse_scale(q4),
q7 = reverse_scale(q7))
You’ll notice that we call the reverse_scale()
function we wrote, but
we call it within the mutate()
function. By doing things this way, we
are overwriting the previous data in the columns for questions 4 and 7
with our newly recoded values for questions 4 and 7.
- Next, we’ll use a function called
pivot_longer()
in order to transform ourpre_survey
dataset from wide format to long format. That means instead of having 1,102 observations of 12 variables, we will now have 11,020 observations of 4 variables. By usingpivot_longer()
, we make it so that each question & response pair has its own line in the data. Therefore, since we have ten-question variables (columns) in thepre_survey
dataset, after we usepivot_longer()
we will end up with ten times as many observations (rows) as before. Additionally, we no longer need a separate column for each individual question since each question-response pair is now on its own line. What was previously one row of data now takes up ten rows of data. sopivot_longer()
automatically deletes those empty columns after condensing all the data. We’ll save this new dataset as an object calledmeasure_mean
in order to represent new variables we will create in future data processing steps.
# Pivot the dataset from wide to long format
<-
measure_mean %>%
pre_survey # Gather questions and responses
pivot_longer(cols = q1:q10,
names_to = "question",
values_to = "response")
- Next, we’ll take our new
measure_mean
dataset and create a column calledmeasure
. We’ll fill that column with one of three question categories:
int
: interestuv
: utility valuepc
: perceived competence
We will use the case_when()
function we learned earlier in order to do
this. When we pivoted from wide format to long format in the prior step,
we ended up with one variable containing all possible question numbers
(q1, q2, etc.). Now, we want to tell R which question numbers correspond
to which categories. To do this, we will supply case_when()
with a
list of all the question numbers that correspond to each category:
interest, utility value, and perceived competence. We’ll introduce a new
operator in order to do this: %in!%
. Practically, this operator means
that R should look within a list for something. In the code below, we
tell R to use questions 1, 4, 5, 8, and 10 to create the category int
.
We then tell R to use questions 2, 6, and 9 to calculate uv
and to use
questions 3 and 7 to calculate pc
. We indicate the start of each list
with the letter “c” and an open parenthesis, and we indicate the end of
each list with a close parenthesis.
# Add measure variable
<- measure_mean %>%
measure_mean # Here's where we make the column of question categories called "measure"
mutate(
measure = case_when(
%in% c("q1", "q4", "q5", "q8", "q10") ~ "int",
question %in% c("q2", "q6", "q9") ~ "uv",
question %in% c("q3", "q7") ~ "pc",
question TRUE ~ NA_character_)
)
- Last, we’ll take that same
measure_mean
dataset and create a new variable calledmean_response.
Since we are calculating the mean by category, we will need to first group the responses together using a function calledgroup_by()
. This function helps us get set up to calculate new columns with grouped data. Next, we’ll use the functionsummarize()
to create two new variables:mean_response
andpercent_NA.
We’ll find the mean response of each category using themean()
function.
# Add measure variable
<- measure_mean %>%
measure_mean # First, we group by the new variable "measure"
group_by(measure) %>%
# Here's where we compute the mean of the responses
summarize(
# Creating a new variable to indicate the mean response for each measure
mean_response = mean(response, na.rm = TRUE),
# Creating a new variable to indicate the percent of each measure that
# had NAs in the response field
percent_NA = mean(is.na(response))
)
measure_mean
## # A tibble: 3 × 3
## measure mean_response percent_NA
## <chr> <dbl> <dbl>
## 1 int 4.25 0.178
## 2 pc 3.65 0.178
## 3 uv 3.74 0.178
With that last step, we have finished processing the pre_survey
dataset. We have renamed the relevant variables and computed some means
that we can use later.
7.9.1 Processing the Course Data
When we loaded the data earlier in the chapter, we loaded three
datasets: pre_survey
, course_data
, and course_minutes.
Next, we
can process the course data that we already loaded into the environment
in order to create new variables which we can use in analyses.
Information about the course subject, semester, and section are stored
in a single column, CourseSectionOrigID
. This format of data storage
is not ideal. If we instead give each piece of information its own
column, we’ll have more opportunities for later analysis. We’ll use a
function called separate()
to do this. Below, we will load
course_data
and run the separate()
function to split up the subject,
semester, and section so we can use them later on.
# Split course section into components
<-
course_data %>%
course_data # Give course subject, semester, and section their own columns
separate(
col = CourseSectionOrigID,
into = c("subject", "semester", "section"),
sep = "-",
remove = FALSE
)
After running the code chunk above, take a look at the dataset
course_data
to make sure it looks the way you’d expect it to look. In
this case, we are expecting that we will add three new variables, taking
the total number of variables in this dataset from eight to 11. We will
still see the original variable CourseSectionOrigID
in the data as
well.
7.9.2 Joining the Data
In this chapter, we are looking at two datasets that are derived from the same courses. In order for these datasets to be most useful to us, we’d like all that data to be in one place.
To join the course data and pre-survey data, we need to create similar keys. Our goal here is to have one variable that matches across both datasets. Once we have that common variable in both datasets, we can merge the datasets on the basis of that variable.
When we look at the course_data
and pre_survey
datasets in our
environment, we see that both have variables for the course and the
student. However, this information is captured in different variable
names in each dataset. Our first goal will be to rename two variables in
each dataset so that they will match. One variable will correspond to
the course, and the other will correspond to the student. We are not
changing anything in the data itself at this step—instead, we are just
cleaning up the column headers so we can look at the data all in one
place.
Let’s start with the pre-survey data. We will rename RespondentID
and
opdata_CourseID
to be student_id
and course_id
, respectively.
Here, we are going to use the same rename()
function we learned
earlier in this chapter.
<-
pre_survey %>%
pre_survey rename(student_id = opdata_username,
course_id = opdata_CourseID)
pre_survey
## # A tibble: 1,102 × 12
## student…¹ cours…² q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 _80624_1 FrScA-… 4 4 4 5 5 4 5 5 5 5
## 2 _80623_1 BioA-S… 4 4 3 4 4 4 4 3 4 2
## 3 _82588_1 OcnA-S… NA NA NA NA NA NA NA NA NA NA
## 4 _80623_1 AnPhA-… 4 3 3 4 3 3 3 4 2 4
## 5 _80624_1 AnPhA-… NA NA NA NA NA NA NA NA NA NA
## 6 _80624_1 AnPhA-… 4 2 2 4 4 4 5 4 4 4
## 7 _80624_1 AnPhA-… NA NA NA NA NA NA NA NA NA NA
## 8 _80624_1 BioA-S… 5 3 3 5 5 4 5 5 3 4
## 9 _80624_1 BioA-T… NA NA NA NA NA NA NA NA NA NA
## 10 _80624_1 PhysA-… 4 4 3 4 4 4 4 4 3 3
## # … with 1,092 more rows, and abbreviated variable names ¹student_id,
## # ²course_id
Those variable names look better now!
When we look at the data more closely, though, we notice that the
student_id
variable has another issue—the variable has some
additional characters before and after the actual ID that we will need
to be able to join this data with the other data sources we have. Why
does this variable have these additional characters? Why is there a “1”
at the end of every five-digit ID number? We are not sure! Sometimes,
educational data from different systems (used for different purposes)
may have additional “meta”-data added on. In any event, here is what the
variables look like before processing:
head(pre_survey$student_id)
## [1] "_80624_1" "_80623_1" "_82588_1" "_80623_1" "_80624_1" "_80624_1"
What we need is the five characters in between the underscore (_) symbols.
One way to do this is to use the str_sub()
function from the {stringr}
package. This function lets us subset “string” variables—variables
that store text data. You can specify the indices of the variables you
want the string to start and end with.
Here, for example, is how we can select only the content starting with
the second character, skipping the first underscore in the process. This
next chunk of code will not change our data but will show you how the
str_sub()
function works by supplying a number-and-underscore
combination to the function that is in the same format as our data.
str_sub("_99888_1", start = 2)
## [1] "99888_1"
We can apply the same thinking to delete characters from the end of a
string. We will use a -
to indicate that we want to start from the
right side of the string of characters. Interestingly, when we specify
the argument end
below, we will tell it the placement of the first
character we want to include. When we type end = -3
, we end up
deleting only the last two characters. Our new rightmost character will
be the final eight.
str_sub("_99888_1", end = -3)
## [1] "_99888"
Putting the pieces together, the following should return what we want. Try running the code below to see if it yields the five-digit ID number we are shooting to extract.
str_sub("_99888_1", start = 2, end = -3)
## [1] "99888"
Note: you may receive a warning telling you that NA
values were
introduced by coercion. This happens when we change data types, and we
will overlook this warning message for the purposes of this
walkthrough.
We can apply this process to our data using mutate()
. We convert the
string into a number using as.numeric()
in the next portion of the
code. This step is important so the data can be joined to the other
numeric student_id
variables (in the other datasets):
# Re-create the variable "student_id" so that it excludes the extraneous characters
<- pre_survey %>%
pre_survey mutate(student_id = str_sub(student_id, start = 2, end = -3))
# Save the new variable as numeric so that R no longer thinks it is text
<- pre_survey %>%
pre_survey mutate(student_id = as.numeric(student_id))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
Now that the student_id
and course_id
variables are ready to go in
the pre_survey
dataset, let’s proceed to the course data. Our goal is
to rename the two variables that correspond to the course and the
student so we can match them with the other variables we just created
for the pre-survey data. In the code chunk below, we will rename both
those variables.
<-
course_data %>%
course_data rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)
Now that we have two variables that are consistent across both
datasets—we have called them course_id
and student_id
. We can now
join the two datasets using the {dplyr} function, left_join()
, which
is named based on the “direction” that the data is being joined. Note
the order of the data frames passed to our “left” join. Left joins
retain all of the rows in the data frame on the “left” and join every
matching row in the right data frame to it. We will use two variables as
keys for joining the datasets, and we will specify those after the word
by
.
Let’s save our joined data as a new object called dat
.
<-
dat left_join(course_data, pre_survey,
by = c("student_id", "course_id"))
dat
## # A tibble: 40,348 × 21
## course_id subject semes…¹ section stude…² Grade…³ Grade…⁴ Final…⁵ Point…⁶
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 AnPhA S116 01 60186 POINTS… <NA> 86.3 5
## 2 AnPhA-S116-01 AnPhA S116 01 60186 WORK A… <NA> 86.3 30
## 3 AnPhA-S116-01 AnPhA S116 01 60186 0.1: M… <NA> 86.3 105
## 4 AnPhA-S116-01 AnPhA S116 01 60186 0.2: I… Hw 86.3 140
## 5 AnPhA-S116-01 AnPhA S116 01 60186 0.3: I… Hw 86.3 5
## 6 AnPhA-S116-01 AnPhA S116 01 60186 1.1: Q… Qz 86.3 5
## 7 AnPhA-S116-01 AnPhA S116 01 60186 1.2: Q… Qz 86.3 20
## 8 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 50
## 9 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 10
## 10 AnPhA-S116-01 AnPhA S116 01 60186 1.4: N… Hw 86.3 50
## # … with 40,338 more rows, 12 more variables: Points_Earned <dbl>,
## # Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, and abbreviated variable names
## # ¹semester, ²student_id, ³Gradebook_Item, ⁴Grade_Category, ⁵FinalGradeCEMS,
## # ⁶Points_Possible
Let’s hone in on how this code is structured. After left_join()
, we
see course_data
and then pre_survey
. In this case, course_data
is
the “left” data frame (passed as the first argument), while
pre_survey
is the “right” data frame (passed as the second
argument). So, in the above code, what happens? You can run the code
yourself to check.
Our aim with that code is that all of the rows in course_data
are
retained in our new data frame, dat
, with matching rows of
pre_survey
joined to it. An important note is that there are not
multiple matching rows of pre_survey
; otherwise, you would end up with
more rows in dat
than expected. There is a lot packed into this one
function. Joins are extremely powerful—and common—in many data
analysis processing pipelines, both in education and in other fields.
Think of all of the times you have data in more than one data frame, but
you want everything to be in a single data frame! As a result, we think
that joins are well worth investing the time to be able to use.
With most types of data, left_join()
is helpful for carrying out most
tasks related to joining datasets. However, there are functions for
other types of joins that we want to make sure you know how to use. They
may be less frequently used than left_join()
, but they are still worth
mentioning. Note that for all of these, the “left” data frame is always
the first argument, and the “right” data frame is always the second.
When running the code chunks below, it can be helpful to pay attention
to the number of observations and variables in the datasets before and
after the joining. Eventually, the obscure names of these types of joins
will start to become more intuitive as you use them more often.
7.9.2.1 semi_join()
semi_join()
joins and retains all of the matching rows in the “left”
and “right” data frame. This is useful when you are only interested in
keeping the rows (or cases/observations) that are able to be joined.
semi_join()
will not create duplicate rows of the left data frame,
even when it finds multiple matches on the right data frame. It will
also keep only the columns from the left data frame.
For example, the following returns only the rows that are present in
both course_data
and pre_survey
:
<-
dat_semi semi_join(course_data,
pre_survey,by = c("student_id", "course_id"))
dat_semi
## # A tibble: 28,655 × 11
## course_id subject semes…¹ section stude…² Grade…³ Grade…⁴ Final…⁵ Point…⁶
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 AnPhA S116 01 60186 POINTS… <NA> 86.3 5
## 2 AnPhA-S116-01 AnPhA S116 01 60186 WORK A… <NA> 86.3 30
## 3 AnPhA-S116-01 AnPhA S116 01 60186 0.1: M… <NA> 86.3 105
## 4 AnPhA-S116-01 AnPhA S116 01 60186 0.2: I… Hw 86.3 140
## 5 AnPhA-S116-01 AnPhA S116 01 60186 0.3: I… Hw 86.3 5
## 6 AnPhA-S116-01 AnPhA S116 01 60186 1.1: Q… Qz 86.3 5
## 7 AnPhA-S116-01 AnPhA S116 01 60186 1.2: Q… Qz 86.3 20
## 8 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 50
## 9 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 10
## 10 AnPhA-S116-01 AnPhA S116 01 60186 1.4: N… Hw 86.3 50
## # … with 28,645 more rows, 2 more variables: Points_Earned <dbl>, Gender <chr>,
## # and abbreviated variable names ¹semester, ²student_id, ³Gradebook_Item,
## # ⁴Grade_Category, ⁵FinalGradeCEMS, ⁶Points_Possible
7.9.2.2 anti_join()
anti_join()
removes all of the rows in the “left” data frame that
can be joined with those in the “right” data frame.
<-
dat_anti anti_join(course_data,
pre_survey,by = c("student_id", "course_id"))
dat_anti
## # A tibble: 1,056 × 11
## course_id subject semes…¹ section stude…² Grade…³ Grade…⁴ Final…⁵ Point…⁶
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 AnPhA S116 01 85865 POINTS… <NA> 52.2 460
## 2 AnPhA-S116-01 AnPhA S116 01 85865 WORK A… <NA> 52.2 10
## 3 AnPhA-S116-01 AnPhA S116 01 85865 0.1: M… <NA> 52.2 10
## 4 AnPhA-S116-01 AnPhA S116 01 85865 0.2: I… Hw 52.2 249
## 5 AnPhA-S116-01 AnPhA S116 01 85865 0.3: I… Hw 52.2 5
## 6 AnPhA-S116-01 AnPhA S116 01 85865 1.1: Q… Qz 52.2 5
## 7 AnPhA-S116-01 AnPhA S116 01 85865 1.2: Q… Qz 52.2 10
## 8 AnPhA-S116-01 AnPhA S116 01 85865 1.3: C… Hw 52.2 10
## 9 AnPhA-S116-01 AnPhA S116 01 85865 1.3: C… Hw 52.2 10
## 10 AnPhA-S116-01 AnPhA S116 01 85865 1.4: N… Hw 52.2 125
## # … with 1,046 more rows, 2 more variables: Points_Earned <dbl>, Gender <chr>,
## # and abbreviated variable names ¹semester, ²student_id, ³Gradebook_Item,
## # ⁴Grade_Category, ⁵FinalGradeCEMS, ⁶Points_Possible
7.9.2.3 right_join()
Perhaps the least helpful of the three, right_join()
, works the same
as left_join()
, but by retaining all of the rows in the “right” data
frame, and joining matching rows in the “left” data frame (so, the
opposite of left_join()
).
<-
dat_right right_join(course_data,
pre_survey,by = c("student_id", "course_id"))
dat_right
## # A tibble: 39,593 × 21
## course_id subject semes…¹ section stude…² Grade…³ Grade…⁴ Final…⁵ Point…⁶
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 AnPhA S116 01 60186 POINTS… <NA> 86.3 5
## 2 AnPhA-S116-01 AnPhA S116 01 60186 WORK A… <NA> 86.3 30
## 3 AnPhA-S116-01 AnPhA S116 01 60186 0.1: M… <NA> 86.3 105
## 4 AnPhA-S116-01 AnPhA S116 01 60186 0.2: I… Hw 86.3 140
## 5 AnPhA-S116-01 AnPhA S116 01 60186 0.3: I… Hw 86.3 5
## 6 AnPhA-S116-01 AnPhA S116 01 60186 1.1: Q… Qz 86.3 5
## 7 AnPhA-S116-01 AnPhA S116 01 60186 1.2: Q… Qz 86.3 20
## 8 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 50
## 9 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 10
## 10 AnPhA-S116-01 AnPhA S116 01 60186 1.4: N… Hw 86.3 50
## # … with 39,583 more rows, 12 more variables: Points_Earned <dbl>,
## # Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, and abbreviated variable names
## # ¹semester, ²student_id, ³Gradebook_Item, ⁴Grade_Category, ⁵FinalGradeCEMS,
## # ⁶Points_Possible
If we wanted this to return exactly the same output as left_join()
(and, in doing so, to create a data frame that is identical to the dat
data frame we created above), we could simply switch the order of the
two data frames to be the opposite of those used for the left_join()
above:
<-
dat_right right_join(pre_survey,
course_data,by = c("student_id", "course_id"))
dat_right
## # A tibble: 40,348 × 21
## student…¹ cours…² q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 2 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 3 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 4 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 5 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 6 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 7 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 8 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 9 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## 10 85791 FrScA-… 3 3 3 3 4 3 3 3 2 2
## # … with 40,338 more rows, 9 more variables: subject <chr>, semester <chr>,
## # section <chr>, Gradebook_Item <chr>, Grade_Category <chr>,
## # FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
## # Gender <chr>, and abbreviated variable names ¹student_id, ²course_id
Now that we’ve gone through the different types of joins available, we
will return to our main focus: joining our course datasets together.
While we didn’t do any data processing steps on it, we still have the
course_minutes
dataset in our environment from when we loaded it there
earlier in the chapter. In the code chunk below, we will rename the
necessary variables in that dataset so that it is ready to merge. Then,
we will merge the course_minutes
dataset and its newly renamed
variables student_id
and course_id
with our dat
dataset.
<-
course_minutes %>%
course_minutes rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)
<-
course_minutes %>%
course_minutes # Change the data type for student_id in course_minutes so we can match to
# student_id in dat
mutate(student_id = as.integer(student_id))
<-
dat %>%
dat left_join(course_minutes,
by = c("student_id", "course_id"))
Note that they’re now combined, even though the course data had many more rows. The pre-survey data has been joined for each student-course combination. We have a pretty large data frame! Let’s take a quick look.
dat
## # A tibble: 40,348 × 22
## course_id subject semes…¹ section stude…² Grade…³ Grade…⁴ Final…⁵ Point…⁶
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 AnPhA S116 01 60186 POINTS… <NA> 86.3 5
## 2 AnPhA-S116-01 AnPhA S116 01 60186 WORK A… <NA> 86.3 30
## 3 AnPhA-S116-01 AnPhA S116 01 60186 0.1: M… <NA> 86.3 105
## 4 AnPhA-S116-01 AnPhA S116 01 60186 0.2: I… Hw 86.3 140
## 5 AnPhA-S116-01 AnPhA S116 01 60186 0.3: I… Hw 86.3 5
## 6 AnPhA-S116-01 AnPhA S116 01 60186 1.1: Q… Qz 86.3 5
## 7 AnPhA-S116-01 AnPhA S116 01 60186 1.2: Q… Qz 86.3 20
## 8 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 50
## 9 AnPhA-S116-01 AnPhA S116 01 60186 1.3: C… Hw 86.3 10
## 10 AnPhA-S116-01 AnPhA S116 01 60186 1.4: N… Hw 86.3 50
## # … with 40,338 more rows, 13 more variables: Points_Earned <dbl>,
## # Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>, and abbreviated
## # variable names ¹semester, ²student_id, ³Gradebook_Item, ⁴Grade_Category,
## # ⁵FinalGradeCEMS, ⁶Points_Possible
It looks like we have 40348 observations from 22 variables.
7.9.3 Finding Distinct Cases at the Student-Level
If a student was enrolled in two courses, she will have a different
final grade for each of those two courses. However, our data in its
current form has many rows representing each course. An easy way we can
visually inspect to make sure every row is the same for the same
student, by course is to use the glimpse()
function. Try it below.
glimpse(dat)
## Rows: 40,348
## Columns: 22
## $ course_id <chr> "AnPhA-S116-01", "AnPhA-S116-01", "AnPhA-S116-01", "An…
## $ subject <chr> "AnPhA", "AnPhA", "AnPhA", "AnPhA", "AnPhA", "AnPhA", …
## $ semester <chr> "S116", "S116", "S116", "S116", "S116", "S116", "S116"…
## $ section <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", …
## $ student_id <dbl> 60186, 60186, 60186, 60186, 60186, 60186, 60186, 60186…
## $ Gradebook_Item <chr> "POINTS EARNED & TOTAL COURSE POINTS", "WORK ATTEMPTED…
## $ Grade_Category <chr> NA, NA, NA, "Hw", "Hw", "Qz", "Qz", "Hw", "Hw", "Hw", …
## $ FinalGradeCEMS <dbl> 86.3, 86.3, 86.3, 86.3, 86.3, 86.3, 86.3, 86.3, 86.3, …
## $ Points_Possible <dbl> 5, 30, 105, 140, 5, 5, 20, 50, 10, 50, 5, 5, 24, 10, 1…
## $ Points_Earned <dbl> 4.05, 24.00, 71.67, 140.97, 5.00, 4.00, NA, 50.00, NA,…
## $ Gender <chr> "F", "F", "F", "F", "M", "F", "F", "F", "F", "F", "M",…
## $ q1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q2 <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ q3 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q4 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q5 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q6 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q7 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q8 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q9 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ q10 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ TimeSpent <dbl> 2087, 2087, 2087, 2087, 2087, 2087, 2087, 2087, 2087, …
You can also use View(dat)
in order to view the data in RStudio’s
viewer.
Visually inspecting the first several rows of data, we see that they all
correspond to the same student for the same course. As we expected, the
FinalGradeCEMs
variable (representing students’ final grade) is also
consistent across these rows.
Since we are not carrying out a finer-grained analysis using the
Gradebook_Item
, these duplicate rows are not necessary. We only want
variables at the student level and not at the level of different
gradebook items. We can extract only the unique student-level data using
the distinct()
function. This function takes as arguments the name of
the data frame and the name of the variables used to determine what
counts as a unique case.
Imagine having a bucket of Halloween candy that has 100 pieces of candy.
You know that these 100 pieces are really just a bunch of duplicate
pieces from a relatively short list of candy brands. distinct()
takes
that bucket and returns one containing a single piece from each distinct
“brand”.
Another thing to note about distinct()
is that it will only return the
variable(s)—it is possible to pass more than one variable to
distinct()
—you used to determine uniqueness unless you include the
argument .keep_all = TRUE
. For the sake of making it simple to view
the output, we will omit this argument for now.
Were we to run distinct(dat, Gradebook_Item)
, what do you think would
be returned? Running the following code returns a one-column data frame
that lists the names of every distinct gradebook item.
distinct(dat, Gradebook_Item)
## # A tibble: 222 × 1
## Gradebook_Item
## <chr>
## 1 POINTS EARNED & TOTAL COURSE POINTS
## 2 WORK ATTEMPTED
## 3 0.1: Message Your Instructor
## 4 0.2: Intro Assignment - Discussion Board
## 5 0.3: Intro Assignment - Submitting Files
## 6 1.1: Quiz
## 7 1.2: Quiz
## 8 1.3: Create a Living Creature
## 9 1.3: Create a Living Creature - Discussion Board
## 10 1.4: Negative Feedback Loop Flowchart
## # … with 212 more rows
You might be wondering whether some gradebook items have the same names
across courses. We can return the unique combination of courses and
gradebook items by simply adding another variable to distinct()
:
distinct(dat, course_id, Gradebook_Item)
## # A tibble: 1,269 × 2
## course_id Gradebook_Item
## <chr> <chr>
## 1 AnPhA-S116-01 POINTS EARNED & TOTAL COURSE POINTS
## 2 AnPhA-S116-01 WORK ATTEMPTED
## 3 AnPhA-S116-01 0.1: Message Your Instructor
## 4 AnPhA-S116-01 0.2: Intro Assignment - Discussion Board
## 5 AnPhA-S116-01 0.3: Intro Assignment - Submitting Files
## 6 AnPhA-S116-01 1.1: Quiz
## 7 AnPhA-S116-01 1.2: Quiz
## 8 AnPhA-S116-01 1.3: Create a Living Creature
## 9 AnPhA-S116-01 1.3: Create a Living Creature - Discussion Board
## 10 AnPhA-S116-01 1.4: Negative Feedback Loop Flowchart
## # … with 1,259 more rows
The data frame we get when we run the code chunk above yields a much longer (more observations) dataset. Thus, it looks like a lot of gradebook items were repeated across courses—likely across the different sections of the same course. If you’d like, you can continue to investigate this: we would be curious to hear what you find if you do!
Next, let’s use a similar process to find the unique values at the
student level. Thus, instead of exploring unique gradebook items, we
will explore unique students (still accounting for the course, as
students could enroll in more than one course). This time, we will add
the keep_all = TRUE
argument.
<-
dat distinct(dat, course_id, student_id, .keep_all = TRUE)
This is a much smaller data frame—with one row for each student in the
course. Whereas our prior version of the dat
dataset had over 40,000
rows, that prior version is only helpful if we wanted to do an analysis
at the level of specific students’ grades for specific gradebook items.
Our new dataset keeps only the unique combinations of student and
course, leaving us with a more manageable number of observations:
603. Now that our data is ready to go, we can start to ask
some questions about it.
Let’s take one last step. Since we will be using the final grade
variable in many of the figures and analyses that follow, let’s rename
it using the {dplyr} rename()
function to something that is a bit
easier to type and remember than FinalGradeCEMS
:
<- rename(dat, final_grade = FinalGradeCEMS) dat
7.10 Analysis
In this section, we focus on some initial analyses in the form of visualizations and some models. We expand on these in Chapter 13. Before we start visualizing relationships between variables in our survey dataset, let’s introduce {ggplot2}, a visualization package we’ll be using in our walkthroughs.
7.10.1 About {ggplot2}
{ggplot2} is a package we’ll be using a lot for graphing and visualizing our education datasets. It is designed to build graphs layer by layer, where each layer is a building block for your graph. Making graphs in layers is useful because we can think of building up our graphs in separate parts: the data comes first, then the x-axis and y-axis, and finally other components like text labels and graph shapes. When something goes wrong and your {ggplot2} code returns an error, you can learn about what’s happening by removing one layer at a time and running it again until the code works properly. Once you know which line is causing the problem, you can focus on fixing it.
The first two lines of {ggplot2} code look similar for most graphs. The
first line tells R which dataset to graph and which columns the x-axis
and y-axis will represent. The second line tells R which shape to use
when drawing the graph. You can tell R which shape to use in your graphs
with a family of {ggplot2} functions that start with geom_
. {ggplot2}
has many graph shapes you can use, including points, bars, lines, and
boxplots. Here’s a {ggplot2} example using a dataset of school mean test
scores to graph a bar chart:
# Create dataset
<-
students tibble(
school_id = c("a", "b", "c"),
mean_score = c(10, 20, 30)
)
# Tell R which dataset to plot and which columns the x-axis and y-axis will represent
%>%
students ggplot(aes(x = school_id, y = mean_score)) +
# draw the plot
geom_bar(stat = "identity",
fill = dataedu_colors("darkblue")) +
theme_dataedu()

Figure 7.1: Example Plot
The data
argument in the first line tells R we’ll be using the dataset
called students
. The aes
argument tells R we’ll be using values from
the school_id
column for the x-axis and values from the mean_score
column for the y-axis. In the second line, the geom_bar
function tells
R we’ll be drawing the graph using the bar chart format. Each line of
{ggplot2} code is connected by a +
at the end to tell R the next line
of code is an additional {ggplot2} layer to add.
Writing code is like writing essays. There’s a range of acceptable styles and certainly, you can practice unusual ways of writing, but other people will find it harder to understand what you want to say. In this book, you’ll see variations in {ggplot2} style, but all within what we believe is the range of acceptable conventions. Here are some examples:
- Piping data to
ggplot()
using%>%
vs including it as an argument inggplot()
- Using
ggtitle()
for labels vs usinglabs()
- Order of
ggplot()
levels
It’s okay if those terms are new to you. The main point is there are multiple ways to make the plot you want. You’ll see that in this book and in other peoples’ code. As you learn, we encourage you to practice empathy and think about how well your code conveys your ideas to others, including yourself when you look at it many weeks from when you wrote it.
7.10.2 The Relationship between Time Spent on Course and Final Grade
One thing we might be wondering is how time spent on the course is
related to students’ final grade. Let’s make a plot to depict that
relationship. Below, we’ll use geom_point
instead of geom_bar
.
%>%
dat # aes() tells ggplot2 what variables to map to what feature of a plot
# Here we map variables to the x- and y-axis
ggplot(aes(x = TimeSpent, y = final_grade)) +
# Creates a point with x- and y-axis coordinates specified above
geom_point(color = dataedu_colors("green")) +
theme_dataedu() +
labs(x = "Time Spent",
y = "Final Grade")

Figure 7.2: Percentage Earned vs. Time Spent
Note: you may receive a warning that reads
Warning message: Removed 5 rows containing missing values (geom_point).
This is due to the NA
values that were introduced through coercion
earlier in this walkthrough and are not a cause for alarm!
There appears to be some relationship. What if we added a line of best
fit—a linear model? The code below is the same plot we just made, but
it includes the addition of another layer called geom_smooth
.
%>%
dat ggplot(aes(x = TimeSpent, y = final_grade)) +
geom_point(color = dataedu_colors("green")) + # same as above
# this adds a line of best fit
# method = "lm" tells ggplot2 to fit the line using linear regression
geom_smooth(method = "lm") +
theme_dataedu() +
labs(x = "Time Spent",
y = "Final Grade")

Figure 7.3: Adding a Line of Best Fit
Looking at this plot, it appears that the more time students spent on the course, the higher their final grade is.
What is the line doing in the upper right part of the graph? Based upon the trend observable in the data, the line of best fit predicts that students who spend a particular amount of time on the course earn greater than 100 for their final grade! Of course, this is not possible and highlights the importance of understanding your data and carefully interpreting lines of best fit (and other, more sophisticated analyses) carefully, keeping that understanding and knowledge in mind as you present and make sense of the results.
7.10.3 Linear Model (Regression)
We can find out exactly what the relationship between these two variables is using a linear model. We discuss linear models in more detail in Chapter 10.
Let’s use this technique to model the relationship between the time
spent on the course and students’ final grades. Here, we predict
final_grade
. The student’s final, reported grade is the dependent, or
y-variable, and so we enter it first, after the lm()
command and
before the tilde (~
) symbol. To the right of the tilde is one
independent variable, TimeSpent
, or the time that students spent on
the course. We also pass, or provide, the data frame, dat
. At this
point, we’re ready to run the model. Let’s run this line of code and
save the results to an object—we chose m_linear
, but any name will
work. We will then run the summary()
function on the output.
<-
m_linear lm(final_grade ~ TimeSpent, data = dat)
summary(m_linear)
##
## Call:
## lm(formula = final_grade ~ TimeSpent, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.14 -7.80 4.72 14.47 30.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.58e+01 1.49e+00 44.13 <2e-16 ***
## TimeSpent 6.08e-03 6.48e-04 9.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.7 on 571 degrees of freedom
## (30 observations deleted due to missingness)
## Multiple R-squared: 0.134, Adjusted R-squared: 0.132
## F-statistic: 88 on 1 and 571 DF, p-value: <2e-16
Another way that we can generate table output is with a function from
the {sjPlot} package, tab_model()
. When you run this code, you should
see the results pop up in the “Viewer” pane of RStudio. If you haven’t
changed the default settings, this will be in the lower right quadrant
of your screen.
tab_model(m_linear,
title = "Table 7.1")
final grade | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 65.81 | 62.88 – 68.74 | <0.001 |
TimeSpent | 0.01 | 0.00 – 0.01 | <0.001 |
Observations | 573 | ||
R2 / R2 adjusted | 0.134 / 0.132 |
This will work well for R Markdown documents (or simply to interpret the
model in R). If you want to save the model for use in a Word document,
the
{apaTables}
(https://cran.r-project.org/web/packages/apaTables/vignettes/apaTables.html)
package may be helpful. To save a table in Word format, just pass the
name of the regression model to a function from the apaTables
package,
as we did with the tab_model()
function. Then, you can save the output
to a Word document, simply by adding a filename
argument:
apa.reg.table(m_linear, filename = "regression-table-output.doc")
You might be wondering what else the {apaTables} package does. We encourage you to read more about the package here: https://cran.r-project.org/web/packages/apaTables/index.html. The vignette is especially helpful. One function that may be useful for writing manuscripts is the following function for creating correlation tables. This function takes, as an input, a data frame with the variables for which you wish to calculate correlations.
Before we proceed to the next code chunk, let’s talk about some
functions we’ll be using a lot in this book. filter()
, group_by()
,
and summarize()
are functions in the {dplyr} package that you will see
a lot in upcoming chapters. You got a preview of these functions earlier
in this chapter, and now that you’ve seen how they are used, we want to
provide clear definitions for each of these functions.
filter()
removes rows from the dataset that don’t match our criteria. Use it for tasks like only keeping records for students in the fifth gradegroup_by()
groups records together so you can perform operations on those groups instead of on the entire dataset. Use it for tasks like getting the mean test score of each school instead of a whole school districtsummarize()
andsummarise()
reduce your dataset down to a summary statistic. Use it for tasks like turning a dataset of student test scores into a dataset of grade levels and their mean test score
Now let’s use these {dplyr} functions on our survey analysis. We will create the same measures (based on the survey items) that we used earlier to understand how they relate to one another.
<-
survey_responses %>%
pre_survey # Gather questions and responses
pivot_longer(cols = q1:q10,
names_to = "question",
values_to = "response") %>%
mutate(
# Here's where we make the column of question categories
measure = case_when(
%in% c("q1", "q4", "q5", "q8", "q10") ~ "int",
question %in% c("q2", "q6", "q9") ~ "uv",
question %in% c("q3", "q7") ~ "pc",
question TRUE ~ NA_character_
)%>%
) group_by(student_id, measure) %>%
# Here's where we compute the mean of the responses
summarize(
# Mean response for each measure
mean_response = mean(response, na.rm = TRUE)
%>%
) # Filter NA (missing) responses
filter(!is.na(mean_response)) %>%
pivot_wider(names_from = measure,
values_from = mean_response)
## `summarise()` has grouped output by 'student_id'. You can override using the
## `.groups` argument.
survey_responses
## # A tibble: 515 × 4
## # Groups: student_id [515]
## student_id int pc uv
## <dbl> <dbl> <dbl> <dbl>
## 1 43146 5 4.5 4.33
## 2 44638 4.2 3.5 4
## 3 47448 5 4 3.67
## 4 47979 5 3.5 5
## 5 48797 3.8 3.5 3.5
## 6 49147 4.25 3.73 3.71
## 7 51943 4.6 4 4
## 8 52326 5 3.5 5
## 9 52446 3 3 3.33
## 10 53248 4 3 3.33
## # … with 505 more rows
Now that we’ve prepared the survey responses, we can use the
apa.cor.table()
function:
%>%
survey_responses apa.cor.table()
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3
## 1. student_id 85966.07 10809.12
##
## 2. int 4.22 0.59 .00
## [-.08, .09]
##
## 3. pc 3.60 0.64 .04 .59**
## [-.05, .13] [.53, .64]
##
## 4. uv 3.71 0.71 .02 .57** .50**
## [-.06, .11] [.51, .62] [.43, .56]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
The time spent variable is on a very large scale (minutes); what if we
transformed it to represent the number of hours that students spent on
the course? Let’s use the mutate()
function we used earlier. We’ll end
the variable name with _hours
, to represent what this variable means.
# creating a new variable for the amount of time spent in hours
<-
dat %>%
dat mutate(TimeSpent_hours = TimeSpent / 60)
# the same linear model as above, but with the TimeSpent variable in hours
<-
m_linear_1 lm(final_grade ~ TimeSpent_hours, data = dat)
# viewing the output of the linear model
tab_model(m_linear_1,
title = "Table 7.2")
final grade | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 65.81 | 62.88 – 68.74 | <0.001 |
TimeSpent hours | 0.36 | 0.29 – 0.44 | <0.001 |
Observations | 573 | ||
R2 / R2 adjusted | 0.134 / 0.132 |
The scale still does not seem quite right. What if we standardized the variable to have a mean of zero and a standard deviation of one?
# this is to standardize the TimeSpent variable to have a mean of 0 and a standard deviation of 1
<-
dat %>%
dat mutate(TimeSpent_std = scale(TimeSpent))
# the same linear model as above, but with the TimeSpent variable standardized
<-
m_linear_2 lm(final_grade ~ TimeSpent_std, data = dat)
# viewing the output of the linear model
tab_model(m_linear_2,
title = "Table 7.3")
final grade | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 76.75 | 75.05 – 78.45 | <0.001 |
TimeSpent std | 8.24 | 6.51 – 9.96 | <0.001 |
Observations | 573 | ||
R2 / R2 adjusted | 0.134 / 0.132 |
When we look at this output, it seems to make more sense. However, there is a different interpretation now for the time spent variable: for every one standard deviation increase in the amount of time spent on the course, students’ final grades increased by 8.24, or around eight percentage points.
7.11 Results
Let’s extend our regression model and consider the following to be the final model in this sequence: What other variables may matter? Perhaps there are differences based on the subject of the course. We can add subject as a variable easily, as follows:
# a linear model with the subject added
# independent variables, such as TimeSpent_std and subject, can simply be separated with a plus symbol:
<-
m_linear_3 lm(final_grade ~ TimeSpent_std + subject, data = dat)
We can use tab_model()
once again to view the results:
tab_model(m_linear_3,
title = "Table 7.4")
final grade | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 70.19 | 66.76 – 73.61 | <0.001 |
TimeSpent std | 9.63 | 7.90 – 11.37 | <0.001 |
subject [BioA] | -1.56 | -8.64 – 5.52 | 0.665 |
subject [FrScA] | 11.73 | 7.38 – 16.08 | <0.001 |
subject [OcnA] | 1.10 | -3.96 – 6.16 | 0.670 |
subject [PhysA] | 16.04 | 10.00 – 22.07 | <0.001 |
Observations | 573 | ||
R2 / R2 adjusted | 0.213 / 0.206 |
It looks like subjects FrSc
—forensic science—and
PhysA
—Physics—are associated with a higher final grade. This
indicates that students in those two classes earned higher grades than
students in other science classes in this dataset.
7.12 Conclusion
In this walkthrough, we focused on taking unprocessed or raw data and loading, viewing, and then processing it through a series of steps. The result was a dataset which we could use to create visualizations and a simple (but powerful!) linear model, also known as a regression model. We found that the time that students spent on the course was positively (and statistically significantly) related to students’ final grades and that there appeared to be differences by subject. While we focused on using this model in a traditional, explanatory sense, it could also potentially be used for predictive analytics in that knowing how long a student spent on the course and what subject their course is could be used to estimate what that student’s final grade might be. We focus on the uses of predictive models further in Chapter 14.
In the follow-up to this walkthrough (see Chapter 13), we will focus on visualizing and then modeling the data using an advanced methodological technique, multilevel models, using the data we prepared as a part of the data processing pipeline used in this chapter.