4 Special Considerations

Data science in education is a new domain. It presents opportunities, like those discussed in the previous chapter, but also some challenges. These challenges vary a lot: we consider doing data science in education to include not only accessing, processing, and modeling data, but also social and cultural factors, like the training and support that these data scientists have available to them. Some of these challenges are common to all domains in which data science is carried out, while others are very particular to the field of education.

4.1 Things to Consider when Doing Data Science in any Domain

4.1.1 Learning to Code

Data scientists everywhere are combining content knowledge, programming, and statistics to solve problems. Many people are not experts in all three areas when they begin their data science journeys. In education, if the piece of the data science trifecta that you are missing is programming, you are not alone. Learning to code can seem like a daunting task, but we don’t want you to feel paralyzed. Data scientists everywhere are combining content knowledge, programming, and statistics to solve problems. Many people are not experts in all three areas when they begin their data science journeys. If you work in education and the piece of the data science trifecta that you are missing is the programming, you are not alone. Programming is powerful but challenging, and many of us in education do not have prior experience with it. To help those who are writing their first few lines of code, we wrote this book for learners without a computer science background or even any informal training related to coding. The great thing about entering a field as flexible as data science is that you are joining a vast crowd of individuals who are self-taught. You will find that there is a very supportive online community to help as you learn through this book.

4.1.2 Addressing Ambiguity: A Reproducible Approach

One way to address concerns of of those feeling wary of data collection and reporting is to build analytic processes that are transparent. Specifically, it is helpful if the data scientist is open about what data is collected, how it is sourced, how it is analyzed, and how it is considered alongside other data when used in decision-making conversations. This transparency can be achieved through a number of activities, including having regular conversations about analytic methods, providing written reports describing data collection, and receiving input about analytic goals from staff members.

One such process for achieving openness in data collection and analysis is called reproducible research. The concept of reproducible work (Wikipedia, 2020) is the idea that a completed analysis should come with all the necessary materials, including a description of methodology and programming code, needed for someone else to run the analysis and achieve the same results. If school staff are apprehensive about how school data is collected and used, a more transparent method showing how the data was transformed into insights could help put their minds at ease.

A reproducible approach can be especially beneficial in transition periods. If a data-science-in-school advocate leaves their original position, they would leave behind not just descriptions of the analyses that they did, but also the specific files needed to run the analyses again. The new individual who takes their place will be able to seamlessly transition into the new role. If asked to run “the same report I always got from your predecessor,” the new person will understand immediately what files were needed to create that original report and will be able to request all necessary data to generate a new version of the report.

To implement a reproducible approach in your organization, you can start by keeping all files related to each project you do in their own folders. As you create reports, keeping notes in the files will help you to easily generate similar reports in the future. Many data scientists find that even though changing administration might mean changing requests, having careful documentation of past processes allows for more efficiency in the way they use data to answer those requests.

4.2 Things to Consider when Doing Data Science in Education

4.2.1 Addressing Organizational Resistance: A Self-Driven Analytic Approach

One consideration when adopting data science in education is that often, there is a lack of precedent for applying data science techniques. For example, it is not common for a teacher to be conducting regression analyses on student data. However, it’s not necessary to wait for a district-wide or state-wide initiative to begin to implement the techniques you will learn in this book.

An organization should encourage their staff to do their own data analyses primarily for the purpose of testing their own hypotheses. In a school, for example, a teacher might wonder about student learning in their classroom and might want to utilize data to directly guide decisions about how they deliver instruction. There are at least two benefits to this approach. First, staff begin to realize the value of doing data analysis as an ongoing inquiry into their outcomes, instead of a special event once a year ahead of school board presentations. Second - and more important for the idea of reducing apprehension around data analysis in schools - school staff begin to demystify data analysis as a process. When school staff collect and analyze their own data, they know exactly how it is collected and exactly how it is analyzed. The long-term effect of this self-driven analytic approach might be more openness to analysis, whether it is self-driven or conducted by the school district.

Building and establishing a data governance system that advocates for an open and transparent analytic process is difficult and long-term work, but the likely result will be less apprehension about how data is used and more channels for school staff to participate in the analysis. Here are more practical steps a school district can take towards building a more open approach to analysis:

  • Make technical write-ups of data analyses available so interested parties can learn more about how data was collected and analyzed
  • Make datasets available to staff within the organization, to the extent that privacy laws and policies allow
  • Establish an expectation that analysts present their work in a way that is accessible to many levels of data experience
  • Hold regular forums to discuss how the organization collects and uses data

By adopting a self-driven analytic approach, individuals can help their education organization to embrace the potential of utilizing data to anticipate and possibly forestall problems in the future.

4.2.2 Lack of Processes and Guidelines

One challenge is the ambiguity around the process and practice of doing data science in the education context specifically. While there is a body of past research on students’ work with data (see (???) for a review), there is limited information from case- or design-based research on how others in education –teachers, administrators, and data scientists– use data in their work. In other words, we do not have a good idea of what best practices in our field are. This challenge is reflected in part in the variability in the job titles of those who work with data: some are data analysts, some are research associates, and the list continues. However, as data science in education emerges as a field, some school districts are now hiring for data scientist positions. Even so, there is a lack of an organizing body that brings all these people together. There are a multitude of discipline-specific (e.g., science teaching) or department-specific (e.g., institutional research) conferences, but no overarching norms universal to those who work with data in education.

Education is a field that is rich with data: survey, assessment, written, and policy and evaluation data, and more. Nevertheless, there often is a lack of common consensus on processes and procedures for educators and data scientists to share data and the results of data analysis with each other. Academic and research settings sometimes can lead to silos of information. A group of researchers at one university could do a survey, and another group doing similar work could not hear about the results until the study is published years later. Sometimes, the second group of researchers may never become aware that the survey had even happened. The good news about this is that many education organizations are curious about and passionate about supporting student success. It is likely that even if many separate data collection efforts are being implemented (rather than one unified strategy), you will not be dealing with the problem of “I don’t have enough data to analyze.” As a pioneer for data science in your organization, you can help to clarify these redundant processes and can offer your skills to help make sense of the wealth of information already being gathered.

4.2.3 Limited Training and Educational Opportunities

Data science in education is new. At the present time, there are limited opportunities for those working in education to build their capabilities in data science (though this is changing to an extent; see Anderson and colleagues’ work to create a data science in education certificate program at the University of Oregon and Baker’s educational data mining Massive Open Online Course offered through Coursera). Many data scientists in education have been trained in fields other than statistics, business analytics, or research. Moreover, the training in terms of particular tools and approaches that data scientists in education utilize are highly varied. However, we believe this diversity of training creates unique opportunities for the development of data science in education as a field. Indeed, educators’ diverse training and backgrounds positions them to tackle challenges in education creatively. One challenge is the ambiguity around the process and practice of doing data science in the educational context specifically. While there is a body of past research on students’ work with data (see (???) for a review), there is limited information from case- or design-based research on how others in education –teachers, administrators, and data scientists– use data in their work. In other words, we do not have a good idea of what best practices in our field are. This challenge is reflected in part in the variability in the job titles of those who work with data: some are data analysts, some are research associates, and the list continues. However, as data science in education emerges as a field, some school districts are now hiring for data scientist positions. Even so, there is a lack of an organizing body that brings all these people together. There are a multitude of discipline-specific (e.g., science teaching) or department-specific (e.g., institutional research) conferences, but no overarching norms universal to those who work with data in education.

Education is a field that is rich with data: survey, assessment, written, and policy and evaluation data, and more. Nevertheless, the lack of a common consensus on processes and procedures for educators and data professionals to share data and the results of data analysis with each other. Academic and research settings sometimes can lead to silos of information. A group of researchers at one university could do a survey, and another group doing similar work could not hear about the results until years later, when the study is published. Sometimes, the second group of researchers may never become aware that the survey had even happened. The good news about this is that many educational organizations are curious about and passionate about supporting student success. It is likely that even if many separate data collection efforts are being implemented (rather than one unified strategy), you will not be dealing with the problem of “I don’t have enough data to analyze.” As a pioneer for data science in your organization, you can help to clarify these redundant processes and can offer your skills to help make sense of the wealth of information already being gathered.

4.2.4 Limited Training and Educational Opportunities

Data science in education is new. At the present time, there are limited opportunities for those working in education to build their capabilities in data science (though this is changing to an extent; see Anderson and colleagues’ work to create a data science in education certificate program at the University of Oregon and Baker’s data mining in education Massive Open Online Course offered through Coursera). Many data scientists have been trained in fields other than statistics, business analytics, or research. Moreover, the training in terms of particular tools and approaches that data scientists utilize are highly varied. However, we believe this diversity of training creates unique opportunities for the development of data science in education as a field. Indeed, educators’ diverse training and backgrounds positions them to tackle challenges in education creatively.

4.2.5 Advancing Equity

Data science can be used to inform decisions that reduce inequities in the education system. However, data science can also be a tool that exacerbates the marginalization of students we want to serve. An example is an algorithm that is not transparent and implemented poorly, and prompts people to make decisions that have adverse effects.

For a data scientist in education, it is crucial that before beginning an analysis, we fully understand how we/our organization defines equity. Additionally, we should formulate clear equity goals and consider the ways we will continuously check against our biases. After defining equity and our equity goals, we can then work to ensure that our data science life-cycle reflects what we are trying to learn (and that we incorporate these learnings).

Thoughtful decisions during the project design and data collection, analysis, and the presentation of the results, can increase the data’s ability to move an organization towards its equity goals. For example, if an organization hopes to decrease the opportunity gap between students affected by poverty and students not affected by poverty, then it is important that they (1) define what ‘affected by poverty’ means, (2) identify the type of project design that will help them understand if they are moving towards their goals, and (3) determine whether their data collection allows them to disaggregate these demographics (see Walkthrough 03). We can then make sure the analyses take these disaggregations into account. The final report should be conscientious of any potential blind spots we may have about the results, as all data is biased and can only ever tell a partial story.

Thoughtful and deliberate data science can help us understand what to do so our students reach their highest potential. Data science can make us more efficient in our tasks. It can increase transparency about what we are doing to help our students. It can also help monitor how we are progressing. However, we must continuously inspect our processes and work to make sure we do not do unintentional harm.

R and RStudio, both freely-available and open, also serve to increase equity in data science. As opposed to proprietary tools, they are accessible to anybody with a computer and internet. The code behind the packages is available online, opening up the “black box” of research. If code is submitted alongside analyses and reports, we can see what decisions were taken to produce the analysis and rerun the analysis ourselves. Using R can enable more audiences to learn, understand, and reuse analyses.

4.2.6 The Complex Nature of Education Data

Education data are complex to collect and to analyze. Education data are often hierarchical, in that data at multiple “levels” are collected. These levels include classrooms, schools, districts, states, and countries - quite the hierarchy! Additionally, an education dataset often require linking with other datasets. For example, when data is collected on students at the school level, it might be important to also know about the training of the teachers in the school. Contextual data about the funding provided by the community in terms of per-pupil spending would be helpful to merge with data about the academic outcomes of students in that school district. The complexity does not end when the data are collected and merged with other relevant information: education data are not simple.

Education data includes characteristics and behaviors of students, teachers, and other individuals. These variables are not always numeric. A categorical variable is a descriptive type of variable with multiple levels for which the levels do not signify quantity but instead signify groups, such as sex or grade level. Education data can also involve open-ended responses that are stored as string variables (a type of variable used to store text), or even recordings that consist of audio and video data. As with the diversity of training for data scientists, the complexity of education data also presents opportunities for educators to creatively approach their tasks. There are specific techniques to efficiently handle each type of data listed above and we will explore some of those techniques in this book.

Often, the variables gathered in education are numeric, but just as often, they are not. Education data involves characteristics of students, teachers, and other individuals that are categorical. A categorical variable is a descriptive type of variable with multiple levels for which the levels do not signify quantity but instead signify groups, such as sex or grade level. It is not quite right to interpret these data as numeric. Additionally, education data can involve open-ended responses that are stored as string variables (a type of variable used to store text), or even recordings that consist of audio and video data. All of these types of data present challenges to the data scientist in education. As with the diversity of training for data scientists, the complexity of data in education also presents opportunities for practitioners to creatively approach their tasks. There are specific techniques to efficiently handle each type of data listed above, and we will explore some of those techniques in this book.

If you are faced with a large and complicated dataset, you might begin by asking yourself what you are curious about and carving out just a couple variables that you can use to answer your question. Your colleague might be interested in an entirely different question, and might consider different variables from the same dataset in their analysis. The complexity of education data need not discourage educators from pursuing their interests.

4.2.8 Analytic Considerations

Due to the particular nature of education data, analysis can be difficult, too. The data is often not ready to be used: it may be in a format that is difficult to open without specialized software, or it may need to be “cleaned” before it is usable. In data science, “cleaning” or “processing” data refers to reorganizing or restructuring the dataset to make it easier to analyze. This process would be analogous to the steps you would take if you received an Excel spreadsheet but found that the columns were in an order that didn’t make sense to you and that there were some duplicate columns. The process you’d go through to reorganize the data to make it logical is data cleaning. Because of the different types of data, data scientists must often use a variety of analytic approaches, such as multi-level models, models for longitudinal data, or even models and analytic approaches for text data. In later chapters of this book, you will learn more specifics about building models.

4.3 Conclusion

While there are many challenges to working with education data, there are many opportunities as well. Once they unlock the power of data science to reveal insights about their organizational context (their students, their policies, etc.), many practitioners in education will become more interested in gathering more data and continuing on this path. Data science becomes a useful tool to help connect with the purpose of your job. Once you begin to rely on data science, it can be hard to stop! As a data scientist in education, remember that you are more closely acquainted with your context than any outside analyst could ever be. This affords you the unique opportunity to become the data and analysis guru in your area.

In summary, educators that want to evolve their data analysis processes into something practical and meaningful to student progress will need to address some unique challenges in order to help all stakeholders to understand the benefits of the questions being answered with data. That hard work will pay off.