16 Teaching data science

Abstract This chapter explores strategies for teaching data science to others. While our focus in this book (and the focus of many doing data science in education) is on the use of data science methods to ask and answer questions and identify and solve problems relating to teaching, learning, and educational systems, how data science is taught and learned is also an important consideration. This is particularly true for those tasked with teaching others, whether in formal settings (such as in classes or workshops) or in those that are informal (such as when providing just-in-time help to a colleague or peer). This chapter first draws attention to the pedagogical principles that undergirded this book, and then describes strategies for teaching data science as well as some general pedagogical strategies that have relevance to those teaching data science. The chapter concludes with a call for those teaching data science to carve out a distinctive field of their own.

16.1 Chapter overview

This book is focused on the application of data science to education. In other words, this book focuses on how to apply data science to questions of teaching, learning, and educational systems. The previous chapters have addressed this topic through narrative and walkthroughs for common questions (or problems) and the types of data encountered in education. In this way, much of the book has focused on applying data science methods. However, for a book on data science in education, it is important to not only discuss the application of data science methods, but also to consider what we know about how to teach data science. In recognition of these dual meanings of data science in education, we’ve referred to the application of data science methods as “data science in education*, and the teaching and learning of data science as”data science *for” education (Joshua M. Rosenberg et al., 2020).

Naturally, educators who do data science are positioned well to try to teach others how to do data science. In addition, we expect readers of this book—many of whom will also be involved in education—will be interested in teaching others about data science.

This chapter is organized around three topics:

  1. The pedagogical principles this book is based upon
  2. Strategies for teaching data science
  3. General strategies related to teaching and learning

16.2 The pedagogical principles this book is based upon

As the authors of a book about data science in education—and readers of books that taught us about data science—we considered what would make it effective for our readers when we set out to write it. The result of this process was a pedagogical framework that consists of four principles: problem-based learning, differentiation, building mental models, and working in the open. We consider each of these in turn.

16.2.1 Problem-based learning

Problem-based learning (PBL) is a method of instruction that asks learners to apply their skills and knowledge to solve a real-world challenge. We applied this principle to the design of this book by including walkthroughs for common data science and education questions. This is especially important in data science because we do not have all of the right answers in this text. Moreover, there is not one right statistical model or algorithm, technique to write code, or piece of software and set of tools to utilize.

Thus, the text features walkthroughs that reflect the types of challenges that educational data scientists may encounter in the course of their work. All of the data (as well as the code) is available, and readers may choose to approach the analysis of the data used in each walkthrough differently. Moreover, the walkthroughs are structured in such a way that readers return to some of the analytic challenges, but with different aims, over the course of the book: Walkthrough 1/Chapter 7, Walkthrough 7/Chapter 13, and Walkthrough 8/Chapter 14 all use the same dataset on online science learning, but Walkthrough 7 expands Walkthrough 1 by its focus on modeling the effects of courses, and Walkthrough 8 takes a predictive, rather than explanatory, goal, through the use of machine learning. Other challenges, such as processing and preparing data, are introduced in the first walkthrough and—reflective of their importance and ubiquity—returned to in each of the subsequent chapters.

16.2.2 Differentiation

Differentiation is a method for providing multiple pathways for learners to engage with, understand, and ultimately apply new content and new skills. To differentiate this text, we first created personas of the common groups of readers we expected to read this book (see Wilson (2009) for an example of this approach).

The objective was to write in a way that helped readers see themselves in the scenarios. The personas were a way to imagine our audience and guide who we interviewed to prepare for the writing. The interviews equipped us to go beyond what we imagined the needs of our readers were and to include their voices in the way we presented the content.

We then aimed to differentiate the book by recognizing and providing background knowledge (either explicitly or through references to other resources) and recommendations for where to begin based on prior expertise. We also provided screenshots–particularly in Chapter 5/Getting Started with R and RStudio and Chapter 6/Foundational Skills—that are annotated and reflective of the content in the text to help show readers how to use what they are reading about.

Lastly, we considered inclusivity and accessibility when differentiating this book. For inclusivity, we considered who makes up the audience for this text and how a broader view of who participates in data science informs the types of challenges, topics, and data that we included and its accessibility (technically, in terms of how a wide audience of readers is able to access and use the book as well how the content is written based on the unique assets that those in education bring), along with how we differentiate the book.

16.2.3 Working in the open

We started writing this book in the open, on GitHub. This allowed us to share the book as it developed. Writing the book in the open also allowed others from the wider educational data science and data science community to contribute. These contributions included writing sections of the book in which contributors had specific expertise, asking clarifying questions, and, even creating a logo for the book which informed our choice of a color palette. We decided to write this book in the open after witnessing the success of other books on data science (such as Wickham (2019) Advanced R (https://adv-r.hadley.nz/) book.

16.3 Building mental models

In the foundational skills chapter, Chapter 6, we introduced the foundational skills framework. The purpose of this framework was to emphasize four core concepts (projects, functions, packages, and data) that are relevant to and used in nearly all data science projects. We chose to introduce this general framework before walkthroughs, which introduce specific techniques, in part to help readers to build a “mental model” of data science: an understanding of how data science tools and techniques at a level deeper than particular functions or individual lines of code (see Krist et al. (2019)’s framework for the development of mental models and this type of deeper understanding). Understanding both how R works as a programming language (what R code is) and how R and RStudio work as software programs can make it easier to troubleshoot the (inevitable!) issues and identify possible solutions in the course of working on educational data science projects.

16.3.1 Universal design

In our original proposal for this book (see R. A. Estrellado et al. (2019)), we noted that Universal Design (McTighe & Willis, 2019; Wiggins et al., 2005) was a part of our pedagogical framework. As we worked toward completing the book, we recognized that we did not fully meet the aims we had laid out. Here is what we wrote in the proposal:

Universal Design is a series of principles which guide the creation of spaces that are inclusive and accessible to individuals from all walks of life regardless of age, size, ability, or disability. While traditionally applied to physical spaces, we have extended these principles to the creation of a data science text in such a way that the text and accompanying materials will be designed for individuals from all walks of life, regardless of educational level, background, ability, or disability. Many of the seven guiding principles of Universal Design are readily transferable to the creation of a text, such as equitable use, flexibility in use (aided in large part through differentiation), simple and intuitive use, perceptible information, and tolerance for error.

While we did not adequately address these in the book, they remain important to us, and we hope to address them in a future edition of the book.

16.4 Strategies for teaching data science

You may be interested in teaching others data science. You may be doing this informally (such as by teaching a colleague in your school district or organization), in a formal environment (such as a class on data science for educational data scientists or analysts), or in some setting in-between (such as a workshop). There is some research on teaching data science, as well as practical advice from experienced instructors, that can inform these efforts.

16.4.1 Provide a home base for learners to access resources (and to learn more)

Learning strategies, along with other important factors (such as learners’ motivations and having a supportive atmosphere), can make a difference for learners. Especially when it comes to learning to do data science, there are many tools and resources to keep track of, such as:

  • How to download and install R
  • How to download and install R Studio
  • How to install packages
  • How to access resources related to the workshop or course (or simply other resources you wish to share)
  • How to contact the instructor
  • How to get help and learn more

Having a “home base” where you can remind learners to look first for resources can help to lower some of learners’ demands in terms of remembering how these tools and resources can be accessed. One way to do this is through a personal website. Another is through GitHub pages. For some organizations, a proprietary learning management system—such as Desire2Learn, Blackboard, Moodle, or Canvas—can be helpful (especially if your learners are accustomed to using them).

16.4.2 When it comes to writing code, think early and often

It is important to get learners to start writing code early and often. It can be tempting to teach classes or workshops that front-load content about data science and using R. While this information is important, it can mean that those you are teaching do not have the chance to do the things they want to do, including installing R (and R Studio) and beginning to run analyses. Because of this, we recommend starting with strategies that lower the barrier to writing code for learners. Ways to do this include:

  • Using R Studio Cloud
  • Providing an R Markdown document for learners to work through
  • Providing a dataset and ideas for how to begin exploring it

While these strategies are especially helpful for courses or workshops, they can be translated to teaching and learning R in tutoring (or “one-on-one”) opportunities for learners. In these cases, being able to work through and modify an existing analysis (perhaps in R Studio Cloud) is a way to quickly begin running analyses—and to use the analysis as a template for analyses associated with other projects. Also, having a dataset associated with a project or analysis—and a real need to analyze it using R—can be an outstanding way for an individual to learn to use R.

16.4.3 Don’t touch that keyboard!

Resist helping learners to the point of hindering their learning. Wilson (2009) writes about the way in which those teaching others about R—or to program, in general—can find it easier to correct errors in learners’ work. But, by fixing errors, you may cause learners to feel that they are not capable of carrying out all of the steps needed in an analysis on their own.

This strategy relates to a broader issue, as well: issues that have to do with writing code that runs correctly (e.g., with the correct capitalization and syntax) can be minor to those with experience programming but can be major barriers to using R independently for those new to it. For example, becoming comfortable with where arguments to functions belong and how to separate them, how to use brackets in functions or loops, and when it is necessary to use an assignment operator can be completely new to beginners. Doing these steps for learners may hinder their capability later when they may have fewer resources available to help them than when you are teaching them. Consider taking the additional time needed to help learners navigate minor issues and errors in their code: it can pay off in increased motivation on their part in the long-term.

16.4.4 Anticipate issues (and sacrifice accuracy for clarity)

Don’t worry about being perfectly accurate early on, especially if doing so would lead to learners who are less interested in the topic you are teaching. Especially in cases for which additional details may not be helpful to beginning learners, it can be valuable to not only anticipate these questions, but to have responses or answers that provide more clarity, rather than confusion.

For example, there are complicated issues at the heart of why data that is built-in to packages or to R (such as the iris dataset) appear in the environment after they are first used in an R session (see the section on “promises” in Wickham (2019)). Similarly, there are complicated issues that pertain to how functions are evaluated that can explain why it is important to provide the name of packages installed via install.packages() (whereas the names of arguments to other functions, such as dplyr::select() do not need to be quoted).

16.4.5 Start lessons or activities with visualizing data

There are examples from data science books by Wickham & Grolemund (2018) and past research (e.g., Lehrer & Schauble (2015)) that suggest that starting with visualizing data can be beneficial in terms of learners’ ability to work with data. Wickham & Grolemund (2018) write that they begin their book, Data Science Using R, with a chapter on visualization, because doing so allows learners to create something they can share immediately, whereas tasks such as loading data can be rife with issues and do not immediately give learners a product they can share. Lehrer et al. (2007) show how providing students with an opportunity to invent statistics by displaying the data in new ways led to productive critique among fifth- and sixth-grade students and their teacher.

16.4.6 Consider representation and inclusion in the data and examples you use

One way to think about data is that it is objective and free of decisions about what to value or prioritize. Another is to consider data as a process that is value-laden, from deciding what question to ask (and what data to collect) to interpreting findings with attention to how others will make sense of them (e.g., O’Neil (2016)’s Weapons of Math Destruction, and Lehrer et al. (2007)’s description of data modeling). From this broader view, choosing representative data is a choice, like others, that teachers can make. For example, instructors can choose data that directs attention to issues—equity-related issues in education, for example—that she or he believes would be valuable for students to analyze.

It is important to consider and question what data is collected and why, even with variables that we consider to be objective. For example, some variables are constructed to be dichotomous (e.g., gender) or categorical (e.g., race), but the data that is collected is based on decisions by the observer and may not be inherently objective.

This broader consideration of data is also important when it comes to which data is used for teaching and learning. For example, if a dataset only includes names of individuals from a majority racial or ethnic group, some learners may perceive the content being taught to be designed for others. While we may think that such issues are better left up to those we are teaching to decide on themselves, setting the precedent in classes, courses, and other contexts in which data science is taught can be important for how learners collect and use data in the future.

16.4.7 Draw on other resources

We touched on a few strategies for teaching data science. There are others that go more into depth on this topic from different perspectives, such as the following:

  • GAISE Guidelines (https://www.amstat.org/asa/education/Guidelines-for-Assessment-and-Instruction-in-Statistics-Education-Reports.aspx): guidelines for teaching statistics
  • Data Science for Undergraduates (https://www.nap.edu/catalog/25104/data-science-for-undergraduates-opportunities-and-options): a report on undergraduate data science education
  • R Studio Education (https://education.rstudio.com/)

There are also a number of data science-related curricula (for the K-12 level) which may be helpful:

Last, there are also books that emphasize the importance—for teachers–of understanding their students—every student. These books include Paris & Alim (2017) and Kozol (2012), and will likely be valuable for teachers of data science who wish to understand and honor the diversity of their students. Moore Jr et al. (2017) and Emdin (2016) may be helpful for data science educators who aim to be aware and intentional about teaching students from backgrounds other than their own.

16.6 Summary

Data science educators do not need to reinvent the wheel when it comes to teaching about data science. Insights from other, related educational domains (such as statistics education and computer science education) may prove helpful to those seeking to teach data science to others, whether in a one-on-one setting, a workshop, or through a formal class. In this chapter, we sought to describe both the pedagogical principles for this book and some strategies for teaching data science. As scholarship and practice where it comes to teaching and learning data science continues to develop, we hope that those teaching (and producing scholarship about) data science not only draw upon the findings of those in other domains but carve out a domain of their own—one with findings that may have implications for how statistics, computer science, or even subject matters such as science and mathematics are learned.