Data science in education presents many opportunities, like those discussed in Chapter 3, but also many challenges. These are varied, and while some are common to all domains in which data science is carried out, others are very particular to the field of education. For example, data science in education includes not only accessing, processing, and modeling data, but also social and cultural factors, like the training and support that educational data scientists have available to them.
Because data science in education is relatively new, it’s understandable that school staff may be wary of how data is collected and analyzed. It’s common for them to question how data is used, particularly if it is used to describe and evaluate staff and student performance. One of the biggest challenges that can arise is when individuals feel concerned that they are being evaluated by unclear or unfair metrics. Usually, “data-driven” efforts mean different things to administrators and educators. To an administrator, a data-driven effort might be an endeavor to better understand the strengths and weaknesses of pre-existing systems, with an eye to eventually proposing new, more efficient systems. To an educator, a data-driven effort might feel like an approach that masks the individuality of students by reducing them to numbers. Neither perspective is exactly correct. While maximizing efficiency and preserving students’ individual needs should certainly be goals of educators and educational administrators, data science is a versatile tool that can be leveraged to help answer a variety of meaningful questions. This chapter will present some thoughts to consider when adopting data science in educational contexts.
Data scientists everywhere are combining content knowledge, programming, and statistics to solve problems. However, many people are not experts in all three areas when they begin their data science journeys, and you are not alone if your programming skills are lacking. Learning to code can seem like a daunting task, but we don’t want you to feel paralyzed. We wrote this book for R learners without a computer science background or even any informal coding training. The great thing about entering a field as flexible as data science is that you are joining a vast crowd of self-taught individuals, and you will find that there is a very supportive online community to help you.
Educators often feel wary of data science processes because of their ambiguity. One way to address this concern is to build analytic processes that are transparent. Specifically, it is helpful if the data scientist in education is open about what data is collected, how it is collected, how it is analyzed, and how it is considered alongside other data when used in decision-making conversations. This transparency can be achieved through many activities, including having regular conversations about analytic methods, providing written reports describing data collection, and receiving input about analytic goals from staff members.
One such process for achieving openness in data collection and analysis is called “reproducible research”. The concept of reproducible work (Wikipedia, 2020) is the idea that a completed analysis should come with all the necessary materials, including a description of methodology and programming code, needed for someone else to run the analysis and achieve the same results.
A reproducible approach can be especially beneficial in transition periods. If a data-science-in-school advocate leaves their original position, they would leave behind not just descriptions of the analyses that they did, but also the specific files needed to run the analyses again. The new individual who takes their place will be able to seamlessly transition into the new role. If asked to run “the same report I always got from your predecessor”, the new person will understand immediately what files were needed to create that original report and will be able to request all necessary data to generate a new version of the report.
To implement a reproducible approach in your organization, you can start by keeping all files related to each project you do in their individual folders. As you create reports from the data, keeping notes in the files will help you easily generate similar reports in the future. Many educators find that even though changing administration might mean changing requests, having careful documentation of past processes allows for more efficiency in the way they use data to answer those requests.
One consideration when adopting data science strategies in educational contexts is that, in some environments, there is no precedent for a data science approach. It is not common, for example, for a teacher to be conducting regression analyses on data. However, it’s not necessary to wait for a district-wide or state-wide initiative to begin to implement the techniques you will learn in this book.
An organization should encourage their staff to do their own data analyses primarily to test their hypotheses. In a school, for example, a teacher might wonder about student learning in their classroom and might want to utilize data to guide decisions about how they deliver instruction. There are at least two benefits to this approach. First, staff begin to realize the value of doing data analysis as an ongoing inquiry into their outcomes, instead of a special event once a year ahead of school board presentations. Second—and more importantly for the idea of reducing apprehension around data analysis in schools—school staff begin to demystify data analysis as a process. When school staff collect and analyze their own data, they know exactly how it is collected and exactly how it is analyzed. The long-term effect of this self-driven analytic approach might be more openness to analysis, whether it is self-driven or conducted by the school district.
Building and establishing a data governance system that advocates for an open and transparent analytic process is difficult and long-term work, but the likely result will be less apprehension about how data is used and more ways for school staff to participate in the analysis. Here are more practical steps a school district can take towards building a more open approach to analysis:
- Make technical write-ups of data analyses available so interested parties can learn more about how data was collected and analyzed
- Make datasets available to staff within the organization, to the extent that privacy laws and policies allow
- Establish an expectation that analysts present their work in a way that is accessible to many levels of data experience
- Hold regular forums to discuss how the organization collects and uses data
By adopting a self-driven analytic approach, individuals can help their education organization to embrace the potential of utilizing data to anticipate and possibly forestall problems in the future.
Educators have concerns about the ambiguity of data science processes because we do not yet have a good idea of the best practices in our field. While there is a body of past research on students’ work with data (see Lee & Wilkerson (2018) for a review), there is limited information from case- or design-based research on how others in education—teachers, administrators, and data scientists—use data in their work. This challenge is reflected in part in the variability in the job titles of those who work with data: some are data analysts, some are research associates, and the list continues. However, as educational data science emerges as a field, some school districts are now hiring for data scientist positions. Even so, there is a lack of an organizing body that brings all these people together. There are a multitude of discipline-specific (e.g., science teaching) or department-specific (e.g., institutional research) conferences, but no overarching norms universal to those who work with data in education.
Education is a field that is rich with data: survey, assessment, written, and policy and evaluation data, and more. Nevertheless, there often is a lack of common consensus on processes and procedures for educators and data scientists to share data and the results of data analysis with each other. Academic and research settings sometimes can lead to silos of information. A group of researchers at one university could do a survey, and another group doing similar work may not see the results until the study is published, years later. Sometimes, the second group never even becomes aware of the survey. The good news about this is that many education organizations are both curious and passionate about supporting student success. It is likely that even if many separate data collection efforts are being implemented (rather than one unified strategy), you will not be dealing with the problem of “I don’t have enough data to analyze”. As a pioneer for data science in your organization, you can help to clarify these redundant processes and can offer your skills to help make sense of the wealth of information already being gathered.
Right now, there are limited opportunities for those working in education to build their capabilities in educational data science (though this is changing; see Anderson and colleagues’ work to create an educational data science certificate program at the University of Oregon and Baker’s educational data mining Massive Open Online Course offered through Coursera). Many educational data scientists have been trained in fields other than statistics, business analytics, or research. Moreover, the training in terms of particular tools and approaches that educational data scientists use are highly varied. However, this diversity of training and background positions educators to tackle educational challenges creatively.
Data science can be used to inform decisions that reduce inequities in the education system. However, it can also be used to exacerbate the marginalization of students we want to serve. An example is an algorithm that is not transparent, that is implemented poorly, and that prompts people to make decisions that have adverse effects.
For a data scientist in education, it is crucial that before beginning an analysis, we fully understand how our organization defines equity. Additionally, we should formulate clear equity goals and consider the ways we will continuously check our biases. After defining equity and our equity goals, we can work to ensure that our data science life-cycle reflects what we are trying to learn.
Thoughtful decisions during the project design and data collection, analysis, and presentation can increase the data’s ability to move an organization towards its equity goals. For example, if an organization hopes to decrease the opportunity gap between students affected by poverty and students not affected by poverty, then it is important that they (1) define what “affected by poverty” means, (2) identify the type of project design that will help them understand if they are moving towards their goals, and (3) determine whether their data collection allows them to disaggregate these demographics (see Walkthrough 3). The organization can then make sure the analyses take these disaggregations into account. The final report should be conscientious of any potential blind spots we may have about the results, as all data is biased and can only ever tell a partial story.
R and RStudio, both freely available and open, also serve to increase equity in data science. As opposed to proprietary tools, they are accessible to anybody with a computer and internet. The code behind the packages is available online, opening up the “black box” of research. If code is submitted alongside analyses and reports, we can see what decisions were made to produce the analysis and rerun it ourselves. Using R can enable more audiences to learn, understand, and reuse analyses.
Thoughtful and deliberate data science can help us understand what to do so our students reach their highest potential. Data science can make us more efficient in our tasks. It can increase transparency about what we are doing to help our students. It can also help monitor how we are progressing. However, we must continuously inspect our processes and work to make sure we do not do unintentional harm.
Education data are difficult to collect and to analyze. It is often hierarchical in that data at multiple “levels” are collected. These levels include classrooms, schools, districts, states, and countries—quite the hierarchy! Additionally, an education dataset often requires linking with other datasets. For example, when data is collected on students at the school level, it might be important to also know about the preparation of the teachers in the school. Contextual data about the funding provided by the community in terms of per-pupil spending would be helpful to merge with data about the educational outcomes of students in that school district. The complexity does not end when the data are collected and merged with other relevant information: education data are not simple.
Often, the variables gathered in education are numeric, but just as often they are not. Education data involves characteristics of students, teachers, and other individuals that are categorical. A categorical variable is a descriptive type of variable with multiple levels for which the levels do not signify quantity but instead signify groups, such as sex or grade level. It is not quite right to interpret these data as numeric. Additionally, education data can involve open-ended responses that are stored as string variables (a type of variable used to store text), or even recordings that consist of audio and video data. All these types of data present challenges to the data scientist in education. As with the diversity of training for educational data scientists, though, the complexity of educational data also presents opportunities for educators to creatively approach their tasks. There are specific techniques to efficiently handle each type of data listed above, and we will explore some of those techniques in this book.
The complexity of education data need not discourage educators from pursuing their interests. If you are faced with a large and complicated dataset, you might begin by asking yourself what you are curious about and then carving out just a couple variables that you can use to answer your question. Your colleague might be interested in an entirely different question and might consider different variables from the same dataset in their analysis.
There are many ethical and legal concerns in working responsibly with education data. At the K–12 level, most datasets require safeguards because youth are a protected population. There might be physical limitations to the places from which a data scientist in education could access confidential data, and there might be limitations on the ways that results of a data analysis can be shared with others within the organization. A closely related issue concerns the aims of education within predetermined constraints. Those working in education often seek to improve it and often work to do so with a scarcity of school and community resources. These ethical, legal, and even values-related concerns may become amplified as the role of data in education increases. They should be carefully considered and emphasized from the outset by those involved in educational data science. If you feel resistance in your organization as you begin to adopt the principles you learn in this book, you might begin by offering to analyze “de-identified” or “anonymous” data. In this way, you show your administration what is possible and foster additional buy-in further down the road.
Due to its nature, analyzing education data can be difficult, too. The data is often not ready to be used: it may be in a format that is difficult to open without specialized software, or it may need to be “cleaned” before it is usable. In data science, “cleaning” or “processing” data refers to reorganizing or restructuring the dataset to make it easier to analyze. This process would be analogous to the steps you would take if you received an Excel spreadsheet but found that the columns were in an order that didn’t make sense to you and that there were some duplicate columns. The process you’d go through to reorganize the data to make it logical is data cleaning. Because of the different types of data, the data scientist in education must often use a variety of analytic approaches, such as multilevel models, models for longitudinal data, or even models and analytic approaches for text data. In later chapters of this book, you will learn more specifics about building models.
While there are many challenges to working with education data, there are many opportunities as well. Once they unlock the power of data science to reveal insights about their organizational context (their students, their teaching, etc.), many educators will become more interested in gathering data and continuing on this path. Data science becomes a useful tool to help connect with the purpose of your job. Once you begin to rely on data science, it can be hard to stop! As an educational professional, remember that you are more closely acquainted with your context than any outside analyst could ever be. This affords you the unique opportunity to become the data and analysis guru in your area.
In summary, educators that want to evolve their data analysis processes into something practical and meaningful to student progress will need to address some unique challenges to help all stakeholders understand the benefits of the questions being answered with data. That hard work will pay off.