9 Walkthrough 3: Using School-Level Aggregate Data to Illuminate Educational Inequities

9.1 Topics Covered

  • Cleaning data
  • Tidying data
  • Exploring data
  • Visualizing data

9.2 Functions Introduced

  • race_pdf()

9.3 Vocabulary

  • aggregate data
  • disaggregated data
  • data frame
  • Free/Reduced Price Lunch (FRPL)
  • histogram
  • lists
  • subgroup
  • trim
  • weighted average

9.4 Chapter Overview

Data scientists working in education don’t always have access to student level data, so knowing how to model aggregate datasets is very valuable. This chapter explores what aggregate data is, and how to access, clean, and explore it.

9.4.1 Background

A common situation encountered when searching for education data, particularly by analysts who are not directly working with schools or districts, is the prevalence of publicly available, aggregate data. Aggregate data refers to numerical information (or non-numerical information, such as the names of districts or schools) that has the following characteristics:

  1. collected from multiple sources and/or on multiple measures, variables, or individuals and
  2. compiled into data summaries or summary reports, typically for the purposes of public reporting or statistical analysis (Schools, n.d.)

Examples of publicly available aggregate data include school-level graduation rates, state test proficiency scores by grade and subject, or mean survey responses. In this walkthrough, we explore the role of aggregate data, with a focus on educational equity.

Aggregate data is essential both for accountability purposes and for providing useful information about schools and districts to those who are monitoring them. For example, district administrators might aggregate row-level (also known as individual-level or student-level) enrollment reports over time. This allows them to see how many students enroll in each school, in the district overall, and any grade-level variation. Depending on their state, the district administrator might submit these aggregate data to their state education agency (SEA) for reporting purposes. These datasets might be posted on the state’s department of education website for anyone to download and use.

Federal and international education datasets provide additional information. In the US, some federal datasets aim to consolidate important metrics from all states. This can be useful because each state has its own repository of data and to go through each state website to download a particular metric is a significant effort. The federal government also funds assessments and surveys which are disseminated to the public. However, federal datasets often have more stringent data requirements than the states, so the datasets may be less usable.

For data scientists in education, these reports and datasets can be analyzed to answer questions related to their field of interest. However, doing so is not always straightforward. Publicly available, aggregate datasets are large and often suppressed to protect privacy. Sometimes they are already a couple of years old by the time they’re released. Because of their coarseness, they can be difficult to interpret and use. Generally, aggregate data is used to surface broader trends and patterns in education as opposed to diagnosing underlying issues or making causal statements. It is very important that we consider the limitations of aggregate data first before analyzing it.

Analysis of aggregate data can help us identify patterns that may not have previously been known. When we have gained new insight, we can create research questions, craft hypotheses around our findings, and make recommendations on how to improve for the future.

We want to take time to explore aggregate data since it’s so common in education but can also be challenging to meaningfully used. This chapter and the following chapter provide two different examples of cleaning an aggregate dataset and of using aggregate datasets to compare student experiences. In this chapter, we’ll focus on educational equity by identifying and comparing patterns in student demographic groups. In the next chapter, we’ll compare student counts longitudinally (or over time) in different states.

What is the Difference Between Aggregate and Student-Level Data?

Let’s dig a little deeper into the differences between aggregate and student-level data. Publicly available data - like the data we’ll use in this walkthrough - is a summary of student-level data. That means that student-level data is totaled to protect the identities of students before making the data publicly available. We can use R to demonstrate this concept.

Here are rows in a student-level dataset:

## # A tibble: 10 x 3
##    student school test_score
##    <chr>   <chr>       <int>
##  1 a       k              84
##  2 b       l              64
##  3 c       m              61
##  4 d       n              46
##  5 e       o              42
##  6 f       k              76
##  7 g       l              79
##  8 h       m              94
##  9 i       n              56
## 10 j       o              96

Aggregate data totals up a variable - the variable test_score in this case - to “hide” the student-level information. The rows of the resulting dataset represent a group. The group in our example is the school variable:

## # A tibble: 5 x 2
##   school mean_score
##   <chr>       <dbl>
## 1 k            25.5
## 2 l            30.5
## 3 m            48  
## 4 n            65.5
## 5 o            15

Notice here that this dataset no longer identifies individual students.

Disaggregating Aggregated Data

Aggregated data can tell us many things, but in order for us to better examine subgroups (groups that share similar characterstics), we must have data disaggregated by the subgroups we hope to analyze. This data is still aggregated from row-level data but provides information on smaller components than the grand total (???). Common disaggregations for students include gender, race/ethnicity, socioeconomic status, English learner designation, and whether they are served under the Individuals with Disabilities Education Act (IDEA) (???).

Disaggregating Data and Equity

Disaggregated data is essential to monitor equity in educational resources and outcomes. If only aggregate data is provided, we are unable to distinguish how different groups of students are doing and what support they need. With disaggregated data, we can identify where solutions are needed to solve disparities in opportunity, resources, and treatment.

It is important to define what equity means to your team so you know whether you are meeting your equity goals.

9.5 Data Sources

There are many publicly available aggregate datasets related to education. On the international level, perhaps the most well-known is PISA:

On the federal level, well-known examples include:

  • Civil Rights Data Collection (CRDC) (https://www2.ed.gov/about/offices/list/ocr/data.html), which reports many different variables on educational program and services disaggregated by race/ethnicity, sex, limited English proficiency, and disability. These data are school-level.

  • Common Core of Data (CCD) (https://www2.ed.gov/about/offices/list/ocr/data.html), which is the U.S. Department of Education’s primary database on public elementary and secondary education.

  • EdFacts (https://www2.ed.gov/about/inits/ed/edfacts/data-files/index.html), which includes state assessments and adjusted cohort graduation rates. These data are school- and district-level.

  • Integrated Postsecondary Education Data System (IPEDS) (https://nces.ed.gov/ipeds/), which is the U.S. Department of Education’s primary database on postsecondary education.

  • National Assessment for Educational Progress (NAEP) Data (https://nces.ed.gov/nationsreportcard/researchcenter/datatools.aspx), which is an assessment of educational progress in the United States. Often called the “nation’s report card,” the NAEP reading and mathematics assessments are administered to a representative sample of fourth- and eighth-grade students in each state every two years.

At the state and district levels, two examples include:

  • California Department of Education (https://www.cde.ca.gov/ds/), which is the state department of education website. It includes both downloadable CSV files and “Data Quest”, which lets you query the data online.

  • Minneapolis Public Schools (https://mpls.k12.mn.us/reports_and_data), which is a district-level website with datasets beyond those listed in the state website.

Selecting Data

For the purposes of this walkthrough, we will be looking at a particular school district’s data; in the next, we will “zoom out” to look across states in the United States.

The district we focus on here reports their student demographics in a robust, complete way. Not only do they report the percentage of students in a subgroup, but they also include the number of students in each subgroup. This allows a deep look into their individual school demographics. Their reporting of the composition of their schools provides an excellent opportunity to explore inequities in a system.

9.5.1 Methods

In this chapter, we will walk through how running analyses on data from a single district can help education data practitioners to understand and describe the landscape of needs and opportunities present there. As opposed to causal analyses, which aim to assess the root cause of an phenomenon or the effects of an intervention, we use descriptive analysis on an aggregate dataset to find out whether there is a phenomenon present, what it is, and what may be worth trying to address through future supports, reforms, or interventions (???).

9.6 Load Packages

As usual, we begin our code by calling the packages we will use.

ROpenSci created the {tabulizer} (https://github.com/ropensci/tabulizer) package (Leeper, 2018) which provides R bindings to the Tabula java library, which can be used to computationally extract tables from PDF documents. {rJava} (Urbanek, 2019) is a required package to load {tabulizer}. Unfortunately, installing {rJava} can be very tedious.

If you find yourself unable to install {rJava}, or would like to go straight to the data processing, you can skip the steps requiring {tabulizer}. We provide the raw and processed data in the {dataedu} package below.

9.7 Import Data

We have three options of getting the data:

  1. We can use {tabulizer}, which pulls the PDF data into lists using extract_tables().
  2. We can get the data from the book’s Github repository(https://github.com/data-edu/data-science-in-education/tree/master/data/agg_data). If you set up the folders in your working directory in the same way they are in the book, where there’s a folder called data, then a folder called agg_data inside of data, then the file race_pdf.Rds in agg_data, then you can run the code below and load the data using here(). Otherwise, you will have to change the file path inside of here() to match where the data is stored on your working directory.
  3. Finally, you can get the data from the {dataedu} package.

We then transform the list to a data frame by first making the matrix version of the PDF’s into a tibble by using map(as_tibble()). Then, we use the map_df() function then turns these tibbles into a single data frame. The slice() inside of map_df() removes unnecessary rows from the PDF. Finally, we create readable column names using set_names() (otherwise, they look like ...1, ...2, etc.).

For the Race/Ethnicity table, we want the totals for each district school as we won’t be looking at grade-level variation. When analyzing the PDF, we see the school totals have “Total” in school_name.

We clean up this dataset by:

  1. Removing unnecessary or blank columns using select(). Negative selections means those columns will be removed.
  2. Removing all Grand Total rows (otherwise they’ll show up in our data when we just want district-level data) using filter(). We keep schools that have “Total” in the name but remove any rows that are Grand Total.
  3. Then we trim white space from strings using trimws().
  4. The data in the percentage columns are provided with a percentage sign. This means percentage was read in as a character. We will have to remove all of the non-numeric characters to be able to do math with these columns (for example, to add them together). Also, we want to divide the numbers by 100 so they are in decimal format.

Now, we will import the Free Reduced Price Lunch (FRPL) PDF’s.

FRPL stands for Free/Reduced Price Lunch and is often used as a proxy for poverty (???). Students from a household with an income up to 185 percent of the poverty threshold are eligible for free or reduced price lunch. (Sidenote: definitions are very important for disaggregated data. FRPL is used because it’s ubiquitous but there is debate as to whether it actually reflects the level of poverty among students.)

Similar to the Race/Ethnicity PDF, we take the matrix output, turn it into tibbles, then create a signle dataframe. There are rows that we don’t need from each page, which we remove using slice(). Then, we create column names that can be easily understood.

To clean the dataset up further, we remove the rows that are blank. When looking at the PDF, we notice that there are aggregations inserted into the table that are not district-level. For example, the report includes ELM K_08, presumably to aggregate FRPL numbers up to the K-8 level. Although this is useful data, we don’t need it for this district-level analysis. There are different ways we can remove these rows but we will just filter them out by using ! before the variable name.

Because we want to look at race/ethnicity data in conjunction with free/reduced price lunch percentage, we join the two datasets by the name of the school. We want our student counts and percentages to be numeric, so apply as.numeric to multiple columns using mutate_at().

Did you notice? The total number of students from the Race/Ethnicity table does not match the total number of students from the FRPL table, even though they’re referring to the same districts in the same year. Why? Perhaps the two datasets were created by different people, who used different rules when aggregating the dataset. Perhaps the counts were taken at different times of the year, and students may have moved around in the meantime. We don’t know but it does require us to make strategic decisions about which data we consider the ‘truth’ for our analysis.

Now we move on to the fun part of creating new columns based on the merged dataset using mutate().

  1. We want to calculate, for each race, the number of students in ‘high poverty’ schools. This is defined by NCES as schools that are over 75% FRPL (Education Statistics U.S. Department of Education, 2019). When a school is over 75% FRPL, we count the number of students for that particular race under the variable [racename]_povnum.
  2. The {janitor} package has a handy adorn_totals() function that sums columns for you. This is important because we want a weighted average of students in each category, so we need the total number of students in each group.
  3. We create the weighted average of the percentage of each race by dividing the number of students by race by the total number of students.
  4. To get FRPL percentage for all schools, we have to recalculate frpl_pct (otherwise, it would not be a weighted average).
  5. To calculate the percentage of students by race who are in high poverty schools, we must divide the number of students in high poverty schools by the total number of students in that race.

To facilitate the creation of plots later on, we also put this data in tidy format using pivot_longer().

Running the code above, particularly the download of the PDFs, takes a lot of time. We’ve saved copies of the merged and tidy data in the book’s Github repository and {dataedu}. To access them, you can run the code below.

9.8 View Data

9.8.1 Discovering Distributions

What do the racial demographics in this district look like? A barplot can quickly visualize the different proportion of subgroups.

Percentage of Population by Subgroup

Figure 9.1: Percentage of Population by Subgroup

When we look at these data, the district looks very diverse. Almost 40% of students are Black and around 36% are White. Note that this matches the percentages provided in the original PDF’s. This shows our calculations above were accurate. Hooray!

frpl_pct is the percentage of the students in the district that are eligible for FRPL.

## # A tibble: 1 x 3
##   school_name category value
##   <chr>       <chr>    <dbl>
## 1 Total       frpl_pct 0.569

56.9% of the students are eligible for FRPL, compared to the U.S. average of 52.1% (???). This also matches the PDF’s. Great!

Now, we dig deeper to see if there is more to the story.

9.8.2 Analyzing Spread

Another view of the data is visualizing the distribution of students with different demographics across schools. Here is a histogram for the percentage of White students within the schools for which we have data.

Count of Schools by White Population

Figure 9.2: Count of Schools by White Population

26 of the 74 (35%) of schools have between 0-10% White students. This implies that even though the school district may be diverse, the demographics are not evenly distributed across the schools. More than half of schools enroll fewer than 30% of White students even though White students make up 35% of the district student population.

The school race demographics are not representative of the district populations but does that hold for socioeconomic status as well?

9.9 Analysis

9.9.1 Creating Categories

High-poverty schools are defined as public schools where more than 75% of the students are eligible for FRPL. According to NCES, 24% of public school students attended high-poverty schools (Education Statistics U.S. Department of Education, 2019). However, different subgroups are overrepresented and underrepresented within the high poverty schools. Is this the case for this district?

Distribution of Subgroups in High Poverty Schools

Figure 9.3: Distribution of Subgroups in High Poverty Schools

8% of White students attend high poverty schools, compared to 43% of Black students, 39% of Hispanic students, 28% of Asian students, and 45% of Native American students. We can conclude that non-White students are disproportionally attending high poverty schools.

9.9.2 Reveal Relationships

Let’s explore what happens when we correlate race and FRPL percentage by school.

FRPL Percentage vs. White Percentage

Figure 9.4: FRPL Percentage vs. White Percentage

Similar to the result in Creating Categories, there is a strong negative correlation between FRPL percentage and the percentage of White students in a school. That is, high poverty schools appear to have a lower percentage of White students and low poverty schools have a higher percentage of White students.

9.10 Results

Because of the disaggregated data this district provides, we can go deeper than the average of demographics across the district and see what it looks like on the school level. These distinct but closely related views demonstrate that:

  1. There exists a distribution of race/ethnicity within schools that are not representative of the district.
  2. Students of color are overrepresented in high poverty schools.
  3. There is a negative relationship between the percentage of White students in a school and the percentage of students eligible for FRPL.

9.11 Conclusion

This analysis, like all analyses, does not occur in a vacuum. According to the Urban Institute, the disproportionate percentage of students of color attending high poverty schools “is a defining feature of almost all Midwestern and northeastern metropolitan school systems” (???). Among other issues, “high poverty schools tend to lack the educational resources - like highly qualified and experienced teachers, low student-teacher ratios, college prerequisite and advanced placement courses, and extracurricular activities - available in low-poverty schools.” This has a huge impact on these students and their futures.

In addition, research shows that racial and socioeconomic diversity in schools can provide students with a range of cognitive and social benefits. Therefore, the deep segregation that exists in the district can have adverse effects on students.

As a data scientist in education, we can use these data to showcase the inequity in a system and suggest interventions for what we can do to improve the situation in the district. In addition, we can advocate for more datasets such as these, which allow us to dig deep. In the next chapter, we discuss aggregate data further, focusing on how we can use them to understand changes over time.