14 Walkthrough 8: Predicting student pass/fail outcomes using supervised machine learning
Abstract
This chapter introduces an increasingly common statistical model — machine learning. Specifically, it focuses on supervised machine learning. This involves identifying an outcome or a dependent variable and then training a model to predict that outcome. Some supervised machine learning models are highly complex, while others are simple. To illustrate this concept, this chapter involves predicting whether students will pass a class using a generalized linear model. The Open University Learning Analytics Dataset (OULAD) is used as example data in learning analytics. The {tidymodels} collection of packages is used to carry out the principal supervised machine learning steps. At the conclusion, ways to build more complex models are discussed.
14.2 Functions introduced
stats::quantile()rsample::initial_split()rsample::training()rsample::testing()recipes::recipe()parsnip::logistic_reg()parsnip::set_engine()parsnip::set_mode()workflows::workflow()tune::collect_predictions()tune::collect_metrics()
14.3 Vocabulary
We introduce these key terms in the chapter:
- supervised machine learning
- training data
- testing data
- logistic regression
- classification
- predictions
- metrics
14.4 Chapter overview
14.4.1 Background
In a face-to-face classroom, teachers use student cues to help them engage. In online classrooms, these cues aren’t always available.
For example, in a face-to-face class, teachers can adjust when they notice that students are distracted. Many online educators look for ways to understand and support students online in the same way that face-to-face instructors would. Technology provides new methods of collecting and storing data that can serve as the basis for this kind of student support.
Online Learning Management Systems (LMSs) automatically track student interactions with the system and feed that data back to the course instructor. The collection of this data is often met with mixed reactions from educators. Some are concerned that this kind of data collection is intrusive, but others see a new opportunity to support students in online classrooms. As long as data is collected and used responsibly, this kind of data collection can support student success.
In this walkthrough, you’ll examine the question, How well can we predict students who are at risk of dropping a course? To answer this question, you’ll use typical learning analytics data—student records, assessment outcomes, course outcomes, and measures of students’ interactions with the course.
Here’s the key shift in thinking you’ll make when using a supervised machine learning model: You’ll focus on predicting an outcome, like whether a student passes a course. This is different from explaining how variables influence an outcome, like how course time relates to final grades. You’ll learn to do this with a generalized linear regression model.
14.4.2 Data sources
We’ll be using a widely-used dataset in the learning analytics field: the Open University Learning Analytics Dataset (OULAD). The OULAD was created by learning analytics researchers at the United Kingdom-based Open University (Kuzilek et al., 2017). It includes data from post-secondary learners’ participation in one of several Massive Open Online Courses. These courses are called modules in the OULAD.
Many students successfully complete these courses, but not all do. This highlights the importance of identifying those who may be at risk.
Our analysis will use three datasets, oulad_students, oulad_assessments, and oulad_interactions_filtered. The oulad_students dataset has undergone minimal preprocessing to streamline the analysis. It uses information from three sources that relate to students and the courses they took: studentInfo, courses, and studentRegistration. The oulad_assessments file provides data on students’ performance on various assessments throughout their courses.
14.4.3 Methods
14.4.3.1 Predictive analytics and supervised machine learning
A buzzword in education software spheres these days is “predictive analytics”. Administrators and educators alike are interested in applying the methods long utilized by marketers and other business professionals to predict what a person will want, need, or do. “Predictive analytics” is a blanket term that describes any statistical approach that yields a prediction. You could ask a predictive model: “What is the likelihood that my cat will sit on my keyboard today?” and, given enough past information about your cat’s computer-sitting behavior, the model could estimate the probability of that computer-sitting happening. Under the hood, some predictive models are not complex. In this chapter, you’ll use one of these simpler predictive models called logistic regression.
There is an adage: “garbage in, garbage out”. This holds true here. If you do not feel confident that the data you collected is accurate, you won’t feel confident in your conclusions, regardless of the model. To collect good data, you must first identify what you want to learn and what information you need to learn it.
Sometimes, people approach analysis from the opposite direction—they look at the data they have and identify questions that could be answered by it. That approach is okay, as long as you acknowledge that the pre-existing dataset may not contain all the information you need to answer your specific questions. You might need to find additional information to add to your dataset to truly answer the questions you have.
At its core, machine learning is the process of training a model to predict accurately on a training dataset (this is the “learning” part of machine learning). Then, this newly trained model is used on new data. At this point, you’ll evaluate how well the model works on data that is not the training data.
Now you’ll dive into the analysis, starting with something you’re familiar with—loading packages.
14.5 Load packages
If you have not installed any of these packages before, do so first using the install.packages() function. For a description of packages and their installation, review the Packages section of the Foundational Skills chapter.
14.7 Preprocessing and feature engineering
Your goal in this walkthrough is to build a model that predicts whether a student is at risk of dropping out. You will handle some feature engineering directly. Later, you will instead handle feature engineering through something called a “recipe.”
How should you decide when to handle feature engineering directly or through a recipe? As you progress in your practice, you’ll be able to determine if it is more efficient to work outside of a recipe. You’ll also get more proficient at determining risks of working outside a recipe, like introducing “data leakage”, which can bias a model. For now, this will not be an issue for simpler models, like the one you’ll be using in this walkthrough.
14.7.1 Step 1: Creating outcome and predictor variables outside the recipe
To begin, create the outcome variable (pass) and a factor variable for disability using mutate():
students <-
students %>%
mutate(pass = case_when(final_result == "Pass" ~ 1,
.default = 0)) %>%
mutate(pass = as.factor(pass),
disability = as.factor(disability))You will also summarize assessment data to create a new predictor for students’ performance on assessments submitted early in the course. Specifically, you’ll calculate the mean weighted score of assessments submitted before the first half of assignment dates.
code_module_dates <-
assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .5, na.rm = TRUE))## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by code_module and code_presentation.
## ℹ Output is grouped by code_module.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(code_module, code_presentation))` for per-operation
## grouping (`?dplyr::dplyr_by`) instead.
assessments_joined <-
assessments %>%
left_join(code_module_dates) %>%
filter(date < quantile_cutoff_date) %>%
mutate(weighted_score = score * weight) %>%
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score, na.rm = TRUE))## Joining with `by = join_by(code_module, code_presentation)`
Last, you will create a socioeconomic status variable (imd_band), again outside the recipe:
students <-
students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>%
mutate(imd_band = as.factor(imd_band))Next, you’ll load a new file with interactions, or log-trace, data. This is the most granular data in the OULAD. In the OULAD documentation, this is called the virtual learning environment (VLE) data source. It’s a large file—even after taking a few steps to reduce its size. The file was prepared by only including interactions for the first one-third of the course.
Import the data using the {dataedu} package.
You will now explore the dataset to understand it better.
First, count() the activity_type variable and sort the resulting output by frequency.
## activity_type n
## 1 forumng 1279917
## 2 subpage 1104279
## 3 oucontent 1065736
## 4 homepage 832424
## 5 resource 436704
## 6 quiz 398966
## 7 url 232573
## 8 ouwiki 66413
## 9 page 33539
## 10 oucollaborate 25861
## 11 externalquiz 18171
## 12 questionnaire 16528
## 13 ouelluminate 13829
## 14 glossary 9630
## 15 dualpane 7306
## 16 htmlactivity 6562
## 17 dataplus 311
## 18 sharedsubpage 103
## 19 repeatactivity 6
You can see there is a range of activities students do. You may wish to explore this data in other ways, even beyond what you do for this exercise.
Think about how you would create a feature with sum_click. Think back to our discussion earlier; we have options for working with such time series data. Perhaps the simplest is to count the clicks.
Please summarize the number of clicks for each student, specific to a single course. This means you will need to group your data by id_student, code_module, and code_presentation, before creating a
summary variable.
interactions_summarized <-
interactions %>%
group_by(id_student, code_module, code_presentation) %>%
summarize(sum_clicks = sum(sum_click))## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by id_student, code_module, and
## code_presentation.
## ℹ Output is grouped by id_student and code_module.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(id_student, code_module, code_presentation))` for
## per-operation grouping (`?dplyr::dplyr_by`) instead.
## # A tibble: 29,160 × 4
## # Groups: id_student, code_module [28,192]
## id_student code_module code_presentation sum_clicks
## <int> <chr> <chr> <int>
## 1 6516 AAA 2014J 999
## 2 8462 DDD 2013J 516
## 3 8462 DDD 2014J 10
## 4 11391 AAA 2013J 528
## 5 23629 BBB 2013B 84
## 6 23698 CCC 2014J 503
## 7 23798 BBB 2013J 277
## 8 24186 GGG 2014B 118
## 9 24213 DDD 2014B 642
## 10 24391 GGG 2013J 424
## # ℹ 29,150 more rows
How many times did students click? Create a histogram to see. Please use {ggplot2} and geom_histogram() to visualize the distribution of the sum_clicks variable you just created.
interactions_summarized %>%
ggplot(aes(x = sum_clicks)) +
geom_histogram(fill = dataedu_colors("darkblue"),
color = "black") +
theme_dataedu()## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

This is a good start — you’ve created the first feature based on the log-trace data, sum_clicks. What are some other features you can add? A benefit of using the summarize() function in R is that you can create multiple summary statistics at once.
Try calculating the standard deviation and mean of the number of clicks. Do this by copying the code you wrote above into the code chunk below and then add these two additional summary statistics.
interactions_summarized <-
interactions %>%
group_by(id_student, code_module, code_presentation) %>%
summarize(sum_clicks = sum(sum_click),
sd_clicks = sd(sum_click),
mean_clicks = mean(sum_click))## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by id_student, code_module, and
## code_presentation.
## ℹ Output is grouped by id_student and code_module.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(id_student, code_module, code_presentation))` for
## per-operation grouping (`?dplyr::dplyr_by`) instead.
Now join all of the data we’ll use for our modeling: students, assessments_joined, and interactions_summarized.
Use left_join() twice, assigning the resulting output the name students_and_interactions.
This is a lot of joining. Sometimes, the hardest part of complex analyses lies in the preparation of data.
14.7.2 Step 2: Splitting the data
As suggested above, a key step in supervised machine learning is splitting data into “training” and “testing” datasets. You’ll be using the training dataset to train the model. Then you’ll use the test dataset to evaluate the model’s performance. You’ll try the model you trained earlier, but this time on new test data by using predictor variables to predict an outcome. The outcome in this walkthrough is students passing the course in question.
We’ll split the dataset into training and testing sets using an 80-20 split. Generally, a split like this is appropriate with a larger dataset; for smaller datasets, something closer to a 50-50 split may be more appropriate. We’ll also conduct a stratified sample using the outcome variable — here, pass; this is generally a good practice (Boehmke & Greenwell, 2019).
set.seed(2025) # As this step involves a random sample, setting the seed ensures the same result for pedagogical purposes
# Specify the proportion for the split
train_test_split <-
initial_split(students_and_interactions, prop = 0.8, strata = "pass")
# Create the training data
data_train <-
training(train_test_split)
# Create the testing data
data_test <-
testing(train_test_split) 14.7.3 Step 3: Creating a recipe for selected preprocessing steps
This is the recipe step mentioned earlier in the walkthrough. You’ll do two things here.
First, you’ll specify which predictor variables predict the outcome. For those familiar with the lm() function in R, this behaves similarly; the outcome goes on the left side of the ~, and predictors go on the right. Note that you’ll be using the training data for this.
Second, you’ll use step_() functions, which are used for preprocessing. These are described in the comments below.
my_rec <-
recipe(pass ~ disability + imd_band + mean_weighted_score +
num_of_prev_attempts + gender + region + highest_education +
sum_clicks + sd_clicks + mean_clicks,
data = data_train) %>%
# This step is to impute missing values for numeric variables
step_impute_mean(mean_weighted_score, sum_clicks, sd_clicks, mean_clicks) %>%
# This step is to impute missing values for categorical/factor variables
step_impute_mode(imd_band) %>%
# Center and scale these variables
step_center(mean_weighted_score, num_of_prev_attempts) %>%
step_scale(mean_weighted_score, num_of_prev_attempts) %>%
# Dummy code all categorical/factor predictors
step_dummy(all_nominal_predictors(), -all_outcomes())Inspect the recipe to verify the steps you have specified:
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 10
##
## ── Operations
## • Mean imputation for: mean_weighted_score, sum_clicks, sd_clicks, ...
## • Mode imputation for: imd_band
## • Centering for: mean_weighted_score num_of_prev_attempts
## • Scaling for: mean_weighted_score num_of_prev_attempts
## • Dummy variables from: all_nominal_predictors() -all_outcomes()
14.7.4 Step 4: Specifying the model and workflow
Next, specify a logistic regression model and bundle the recipe and model into a workflow.
This step has a lot of pieces, but they are fairly boilerplate. First, specify the model:
my_mod <-
logistic_reg() %>% # Specifies the type of model
set_engine("glm") %>% # Specifies the package we use to estimate the model
set_mode("classification") # Specifies whether we are classifying a dichotomous, categorical, or factor variableNext, specify the workflow, which will stitch the recipe and model together:
You’re almost there!
14.7.5 Step 5: Fitting the model
Now you’ll fit the model to the training data. First, you’ll need to specify which metrics you want to calculate. These are statistics that will help you understand how good the model is at making predictions. Do this with metric_set(), specifying the familiar accuracy, precision, and recall.
Finally, you’ll fit the model. Do this by calling the last_fit() function on the workflow and the split specification of the data. You’ll also specify which metrics to use.
Note that while there are other fitting functions available, you’re using last_fit to do two steps at once: fitting the model to the training data, then using the test data to evaluate the model’s performance.
Other times, you may use cross-validation, where you’ll split the training data many times and fit the model to each of these splits. In this case, you’ll only examine the performance of the model after you have trained your last model. For more about this technique, consider reading Chapter 2 in (Boehmke & Greenwell, 2019).
14.7.6 Step 6: Evaluating model performance
Finally, evaluate the model’s performance using the test set. The {tidymodels} package makes this easy:
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.640 pre0_mod0_post0
## 2 precision binary 0.647 pre0_mod0_post0
## 3 recall binary 0.926 pre0_mod0_post0
How did the model do? Focus on accuracy for now: you can see the model correctly predicted whether students passed around 64% of the time.
But, the model seemed to perform differently for students who passed versus those who did not. You can see this by looking at the precision and recall metrics. The recall value of .925 tells us that the model correctly predicted when students passed the course around 92% of the time. The precision tells us that when the model predicted a student passed the course, it was correct about 65% of the time. In other words, the model tended to make false positive predictions.
Herein lies the value of metrics other than accuracy: they can help you understand how the model is performing for different outcomes: false positives or false negatives may matter more or less depending on the context. Your knowledge as the analyst and researcher is critical in determining whether the model is “good enough” for your purposes.
On that note, how could you improve the model? One affordance of {tidymodels} is that you can easily switch the model and engine. Try one of these modifications for a random forest and a boosted tree model and see how the predictive performance improves. How much better did the model do with these?
14.8 Conclusion
Though this is a relatively simple model, many of the ideas explored in this chapter will prove useful for other machine learning methods. The goal is for you to finish this walkthrough with the confidence to explore using machine learning to answer a question or to solve a problem of your own in the areas of teaching, learning, and educational systems.
In this chapter, we introduced general machine learning ideas in the context of predicting students’ passing a course. Like many of the topics in this book, there is much more to discover on the topic. We encourage you to consult the books and resources in the Learning More chapter for more about machine learning methods.