15 Introducing data science tools to your education job

Abstract This chapter explores using the tools of data science in an education job. Learning techniques like cleaning data, visualizing data, and modeling data are the first part of integrating data science tools in an education job. The next part is learning to apply these skills in practical day to day work. Using examples in R, this chapter explains how statistical programming increases the speed and scale of data work. Practicing these skills in day to day work will empower staff to get the most out of their training and develop a meaningful role as a data scientist in education. Data science tools explored in this chapter include empathy, collaboration, and adopting an entrepreneurial mindset.

15.1 Chapter overview

The purpose of this section is to explore taking newfound data science skills into your workplace. In particular, you’ll learn to find practical ways to use your skills, encourage your coworkers to be better users of data, and develop meaningful analytic routines. Whether you are a consultant working with a community college, an administrator leading teachers at a high school, or a university department chair, you’ll eventually begin turning what you’ve learned in this book into meaningful activities.

We’ll discuss this topic in two parts: bringing your organization the gift of speed and scale, and the importance of connecting well with others. Then we’ll close this chapter by discussing ways that K-12 teachers in particular can engage a work culture that is new to using data in decision making.

15.2 The gift of speed and scale

The power of doing data analysis with a programming language like R comes from two features: (1) a massive boost in the speed of your work and (2) a massive boost in the volume of data you analyze. Here are approaches to introducing data science to your job that focus on practical applications of speed and scale.

15.2.1 Working with data faster

Data analysts who use an efficient analytic process understand their clients’ questions faster because they rapidly cycle through analysis and discussion. As a result, they quickly accumulate skill and experience because each project includes many cycles of data analysis.

Lots of data scientists have built and shared analytic routines. Roger Peng and Elizabeth Matsui discuss epicycles of analysis in their book The Art of Data Science. In their book R for Data Science, Garrett Grolemund and Hadley Wickham demonstrate a routine for data exploration.

What’s important is to pick a routine that keeps you focused on repeatable steps. When the problem space is unclear, as it often is education data analysis, the path from research question to analysis is full of detours and distractions. Having a routine that points you to the next immediate analytic step helps you start quickly. And many quick starts results in a lot of analysis.

Moving through data analysis systematically and rapidly has other benefits. It fuels the creativity required to understand problems in education and the imaginative solutions required to address them. Quickly analyzing data keeps the analytic momentum going at the speed needed to engage organic exploration of the problem.

Imagine an education consultant working with a school district to measure the effect of a new math intervention. The superintendent wants to compare quiz scores across schools. If the consultant can rapidly produce repeatedly reports for discussion, conversations can begin and continue with even more questions.

Quickly summarizing a teacher’s latest batch of quiz scores and then returning for discussion should feel like a fast-paced and inspiring conversation with a collaborator. Contrast this with sluggish correspondence resulting from slow analysis. In these cases, the teacher’s research question might’ve changed by the time the consultant returns with the data summary.

You have an opportunity to provide speedy data analyses that sparks more and more important analytic questions. Thinking again about the consultant analyzing quiz scores, it’s not hard to imagine new questions arising from rapid cycles of analysis:

  • How big was the effect of the new intervention, if any?
  • Are there similar effects across student groups?
  • Are there similar effects across grade levels?

The trick here is to use statistics, programming, and content knowledge to raise and answer the right questions quickly so the process feels like a conversation. When there’s too much time between analytic questions and their answers, educators lose the momentum required to follow the logical and exploratory path towards understanding student success.

15.2.1.1 Example: Preparing Quiz Data to Compute Average Scores

Go back to the imaginary consultant tasked with computing the average quiz scores. Imagine the school district uses an online quiz system and each teacher’s quiz export looks like this:

library(tidyverse)
set.seed(2020)

quizzes_1 <- tibble(
    teacher_id = 1, 
    student_id = c(1:3), 
    quiz_1 = sample(c(0:100), 3, replace = TRUE), 
    quiz_2 = sample(c(0:100), 3, replace = TRUE), 
    quiz_3 = sample(c(0:100), 3, replace = TRUE)
)

quizzes_1
## # A tibble: 3 × 5
##   teacher_id student_id quiz_1 quiz_2 quiz_3
##        <dbl>      <int>  <int>  <int>  <int>
## 1          1          1     27     87     35
## 2          1          2     86     64     41
## 3          1          3     21     16     69

Spreadsheets can help you compute mean quiz scores or mean student scores quickly. But what if you’d like to do that for not one, but five teachers? First, tidy the data. This will prepare the data nicely computing a variety of scores, including the mean. Start by using pivot_longer() to separate each student’s quiz number and its score:

quizzes_1 %>% 
    pivot_longer(cols = quiz_1:quiz_3, names_to = "quiz_number", values_to = "score")
## # A tibble: 9 × 4
##   teacher_id student_id quiz_number score
##        <dbl>      <int> <chr>       <int>
## 1          1          1 quiz_1         27
## 2          1          1 quiz_2         87
## 3          1          1 quiz_3         35
## 4          1          2 quiz_1         86
## 5          1          2 quiz_2         64
## 6          1          2 quiz_3         41
## 7          1          3 quiz_1         21
## 8          1          3 quiz_2         16
## 9          1          3 quiz_3         69

In the original dataset, each row represents a unique combination of teacher and student. After using pivot_longer(), each row now represents a unique combination of teacher, student, and quiz number.

Data scientists describe this transformation as going from “wide” to “narrow.” That’s because of the change in the dataset’s width. The benefit comes from the ability to group values by any of the columns. For example, here is how you’d group the scores by student_id and compute the mean:

quizzes_1 %>% 
    pivot_longer(cols = quiz_1:quiz_3, names_to = "quiz_number", values_to = "score") %>%
    group_by(student_id) %>% 
    summarise(quiz_mean = mean(score))
## # A tibble: 3 × 2
##   student_id quiz_mean
##        <int>     <dbl>
## 1          1      49.7
## 2          2      63.7
## 3          3      35.3

You can use this technique to compute the mean test score across multiple teachers. Start by creating two more datasets. To make the example more authentic, add a column to show if a student received an instructional intervention.

# Add intervention column to first dataset 
quizzes_1 <- quizzes_1 %>% 
    mutate(intervention = sample(c(0, 1), 3, replace = TRUE))

# Second imaginary dataset
quizzes_2 <- tibble(
    teacher_id = 2, 
    student_id = c(4:6), 
    quiz_1 = sample(c(0:100), 3, replace = TRUE), 
    quiz_2 = sample(c(0:100), 3, replace = TRUE), 
    quiz_3 = sample(c(0:100), 3, replace = TRUE), 
    intervention = sample(c(0, 1), 3, replace = TRUE)
)

# Third imaginary dataset
quizzes_3 <- tibble(
    teacher_id = 3, 
    student_id = c(7:9), 
    quiz_1 = sample(c(0:100), 3, replace = TRUE), 
    quiz_2 = sample(c(0:100), 3, replace = TRUE), 
    quiz_3 = sample(c(0:100), 3, replace = TRUE), 
    intervention = sample(c(0, 1), 3, replace = TRUE)
)

Use this method to compute the mean quiz score for each student:

  1. Combine all datasets into one: Use bind_rows() to combine all three quiz exports into one dataset. This can be done because each teacher’s export uses the same number of columns and column names

  2. Reuse the code from the first dataset on the new combined dataset: Paste the code used in the first example into the script so it cleans and computes the mean quiz score for each student

# Use `bind_rows` to combine the three quiz exports into one big dataset
all_quizzes <- bind_rows(quizzes_1, quizzes_2, quizzes_3) 

Note there are now nine rows, one for each student across all three teacher quiz exports:

all_quizzes
## # A tibble: 9 × 6
##   teacher_id student_id quiz_1 quiz_2 quiz_3 intervention
##        <dbl>      <int>  <int>  <int>  <int>        <dbl>
## 1          1          1     27     87     35            0
## 2          1          2     86     64     41            0
## 3          1          3     21     16     69            1
## 4          2          4     55     79      2            1
## 5          2          5     71     28     65            1
## 6          2          6     41     97     92            1
## 7          3          7     77     47      6            1
## 8          3          8     77     46     83            0
## 9          3          9     75     77     17            1

Use the same approach you used when doing this for one teacher:

# Reuse the code from the first dataset on the new bigger dataset
all_quizzes %>% 
    # Clean with pivot_longer
    pivot_longer(cols = quiz_1:quiz_3, names_to = "quiz_number", values_to = "score") %>%
    # Compute the mean of each student
    group_by(student_id, intervention ) %>% 
    summarise(quiz_mean = mean(score))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by student_id and intervention.
## ℹ Output is grouped by student_id.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(student_id, intervention))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
## # A tibble: 9 × 3
## # Groups:   student_id [9]
##   student_id intervention quiz_mean
##        <int>        <dbl>     <dbl>
## 1          1            0      49.7
## 2          2            0      63.7
## 3          3            1      35.3
## 4          4            1      45.3
## 5          5            1      54.7
## 6          6            1      76.7
## 7          7            1      43.3
## 8          8            0      68.7
## 9          9            1      56.3

Note that the hypothetical consultant is thinking ahead by including the intervention column. By doing so, she’s preserved the possibility of exploring possible score differences between the students who had the intervention and the students who did not. Thinking ahead in this way builds conversation starters into your collaborations. If you can anticipate your client’s questions, you’ll be able to respond faster if and when these questions arise.

Comparing using R and spreadsheets, there may be little difference in the time it takes to do this for three quiz exports. But the gains in speed when computing over many datasets—say, 30 quiz exports—is truly useful.

15.2.1.2 Summary

Learning how to rapidly answering analytic questions is not a silver bullet (but really, what is?). But it can put you on the path to creative solutions. It works something like this:

  1. Rapidly answering analytic questions empowers you to help more people
  2. Helping more people creates more chances to practice your analytic skills
  3. Helping more people also creates more chances for clients to see the value of data science
  4. These two things—increased practice and a growing sense of value—nurtures confidence
  5. Confidence encourages experimentation and creativity when developing solutions for students

Here are more ways to get faster:

  • Notice when you use the same or similar chunks of code to do repetitive tasks. Store that code in a script so you can use it again
  • Keep a notebook of questions your clients ask you. Review them to develop an instinct for what teachers and administrators need. Anticipate these needs in your analysis and code
  • Where possible, avoid writing separate scripts for each dataset. Instead, learn to use functions and packages like {purrr} to write reusable operations for many datasets
  • Avoid the perfection trap when writing code. Instead, write first draft scripts and show the output to clients. They can give feedback early that informs your work going forward

15.2.2 Working with more data

When data analysts create more data for their clients, they give them more opportunities to learn what helps their students. In this regard, R is helpful because functions, loops, and scripts work on many datasets in the same time it takes to work on one.

For example, imagine a teacher whose students have an average quiz score of 75%. This helpful to know because it shows her how close her students are to some pre-determined standard, say 95%. But that data alone doesn’t tell the teacher how similar or different her class average is from other classrooms. For that, you need context.

Now imagine applying the same script used to compute this teacher’s average quiz score to every classroom in the school. And imagine the school average quiz score was 77%. From this, the teacher learns her class average is similar to other classrooms. This is important nuance for developing her improvement strategy.

This example shows how working with scripts super charges your analysis. Using spreadsheets to work with large datasets, say 10,000 rows, requires you to interact directly with the user interface. That includes lots of scrolling, highlighting, and clicking that isn’t saved for future use. In contrast, using a programming language with large datasets empowers you to issue complex and repeatable instructions to act on the data.

15.2.2.1 Example: replacing many student names with mumerical IDs

Say, for example, an elementary school administrator wants to replace each students’ name in a classroom dataset with a numerical ID. Doing data entry in a spreadsheet for one classroom is feasible. Doing it for a all classrooms in a school requires a different approach. Instead of the administrator entering unique IDs into a spreadsheet, they can write an R script that executes these steps:

  1. Use read_csv() to read every classroom’s student list into R
  2. Use bind_rows() to combine the separate lists into one long list
  3. Use mutate() to replace student names with a randomized and unique numerical ID
  4. Use split() to separate the data into classrooms again
  5. Use {purrr} and write_csv() to create and rename individual spreadsheets before sending back to teachers

With some initial investment in coding time at the start, the administrator will have a script they can reuse in the future to do the task again.

15.3 Other ways to reimagine the scale of your work

15.3.1 Reflect on your current scale, then push to the next level

It’s easy to forget about reflecting on your analysis routines. This is especially true if you’ve been using the same routines for a long time. The analytic questions we ask, the datasets we use, and the scale of the questions become automatic because, for the most part, they’ve delivered results. When you introduce data science and R into your workflow, you introduce a chance to regularly explore scaling your analysis.

For example, when a client or coworker has an analytic question, consider the following:

  1. At what level is this question about? Is it an analytic question about students, classrooms, schools, districts, regions, the whole state, or the whole country?
  2. What can we learn by answering at that level, but also at the next level?

If a teacher asks you to analyze the attendance pattern of one student, see what you learn by comparing it to the attendance pattern of the whole classroom or the whole school. If a superintendent of a school district asks you to analyze the behavior referrals of a school, analyze the behavior referrals of every school in the district. Remember that once you write code for one dataset, you can use it for many.

15.3.2 Look for lots of similarly structured sata

Be on the lookout for datasets that have the same column structure, then write code to act on multiple datasets at once.

Education data systems tend to generate standardized data tables. The result is many datasets that contain different data, but with a standardized column structure. This uniformity creates the perfect condition for R scripts to predictably and repeatedly act on these datasets.

Imagine a student information system that exports a list of students, their teacher, their grade level, and days attended. If a school administrator downloads this list each week to a folder on their laptop, they’d have many uniformly structured datasets.

When you start seeing situations like this as an opportunity for analysis, you position yourself to transform data at scale.

15.3.3 Cleaning data

Cleaning data is the perfect opportunity to work with data at scale. Educators want to look at data with tools like Excel, but very often the data aren’t structured for that use case.

Offer to clean a dataset so your clients can explore data in their tool of choice. As you demonstrate how quickly you can clean large volumes of data, you accomplish two things. First, you improve your skills through repetition. And second, you expand your client’s expectations for what’s possible with you on their team.

15.4 Solving problems together

Steven Spielberg said,

When I was a kid, there was no collaboration; it’s you with a camera bossing your friend around. But as an adult, filmmaking is all about appreciating the talents of the people you surround yourself with and knowing you could never have made any of these films by yourself.

Data science techniques are a powerful addition to an organization’s problem-solving capacity. But when you’re the only person who codes, it’s easy to forget how important it is to collaborate. Here are some things to think about as you introduce data science to your education job.

15.4.1 Data science in education and empathy

In 1990 Elizabeth Newton, then a Stanford University graduate, asked research subjects to tap out well-known songs with their fingers. Then she asked them to estimate how many people would recognize the songs (Newton, 1991, Heath and Heath 2006). She found that they overestimated every time.

When you know a subject well, you tend to forget what it’s like to not know. Here are suggestions for helping your coworkers feel included as they join you on your data science journey.

First, listen to your coworkers as they work with data. Learn about their their current thinking as they use reports, tables, and graphs. This will help you understand the problems and solutions they focus on.

Second, ask them if you can “borrow the problem.” “Borrowing a problem” is not solving it for them, it’s using data science to get them unstuck so they can continue. For example, if a client is struggling to make a plot, offer to clean and summarize the dataset before they try again.

Third, if your first attempt at borrowing the problem didn’t help, listen and learn more. Doing data science together is a conversation. Ask them how it went after you you helped. Listen, understand, and try again. After a few rounds of this, you may find your coworkers willing to try new methods for advancing their goals.

15.4.2 Create a daily practice commitment that answers someone else’s question

In his book Feck Perfuction, designer Victore (2019) writes, >Success goes to those who keep moving, to those who can practice, make mistakes, fail, and still progress. It all adds up. Like exercise for muscles, the more you learn, the more you develop, and the stronger your skills become > >
— p. 31

Doing data science, like all skills, needs repetition and mistakes as fuel for learning. But what happens if you are the first person in your organization to do data science? When you have no data science mentors, analytic routines, or examples of past practice, it can feel aimless. The antidote to that aimlessness is to practice daily.

Commit to writing code every day. Even a simple three-line scripts will add to your growing programming instincts. Train your ears to be radars for data projects that are usually done in a spreadsheet. Then take them on and do them in R. Need the average amount of time a student with disabilities spends in special instruction? Try it in R. Need to rename the columns in a student quiz dataset? Try it in R.

15.4.3 Build your network

Participating in personal and professional networks is important for surviving, thriving, and innovating. Connecting with an education data science network is easier if your organization has an analytics department, but you’ll have to get creative if you’re the lone data scientist.

Naturally, you might seek out programmers and statisticians first. But consider also that data science in education is not just about statistics. In the broader view it’s about evolving the whole approach to analytics. When viewed that way, members of a network broaden beyond just programmers and statisticians. It grows to include administrators and staff, graduate students fascinated with unique research methodologies, and designers who create interesting approaches to measurement.

Networks for growing data science in education are not limited to the workplace. There are plenty of online and real-life chances to participate in a network that are just as rewarding as the networks you find at the office. Here are some ideas:

  • Social media networks
  • Local coding communities
  • Data Science or Education Conferences
  • Online forums

15.5 For K-12 teachers

So far the discussion has been from the data scientist’s point of view. What if you’re interested in analytics but not in programming and statistics? Teachers in elementary and high schools are faced with a mind-boggling amount of student data. A study by Campaign (2018) estimated that “95 percent of teachers use a combination of academic data (test scores, graduation rates, etc.) and nonacademic data (attendance, classroom, behavior, etc.) to understand their students’ performance”. 57% of the teachers in the study said a lack of time was a barrier to using the data they have. Data literacy is also increasingly important within teacher preparation programs (Mandinach & Gummer, 2013).

Yet the majority of teachers aren’t interested in learning a programming language and statistical methods. Further, scarce time is needed for professional development (Datnow & Hubbard, 2015). After all, most teachers chose their profession because they love teaching, not because they enjoy cleaning datasets and building statistical models. But to leave them out feels like a glaring omission in a field where perhaps the most important output is student learning.

If you happen to be a K-12 teacher who wants to use more programming and statistics more, you will find the approaches in this book useful. But if not, there is still much to explore that will help you grow your analytic skills.

For example, you can explore an often overlooked element of data analysis: asking the correct question. Chapter three of The Art of Data Science (Peng & Matsui, 2015) provides a useful process for getting better at asking data questions.

Given how often we experience data through visualizations, it is important to learn how to create and use them. Chapter one of the book Data Visualization: A Practical Introduction (Healy, 2019) explores this topic using excellent examples and writing.

For practical applications of a data-informed approach, Learning to Improve: How America’s Schools Can Get Better at Getting Better (Bryk et al., 2015) offers a thorough explanation of the improvement science process. The book is filled with examples of how data is used to understand problems and trial solutions.

The final recommendation for K-12 teachers is this: partner with someone who can help you answer your analytic questions quickly and at scale. You have the professional experience to ask the important questions. Inviting someone to collaborate with you and measure the success of your ideas can be a rewarding partnership.