20 Appendices
This chapter includes three appendices:
- Appendix A: Importing data (associated with Chapter 6)
- Appendix B: Social network influence and selection models (associated with Chapter 12)
- Appendix C: Colophon
20.1 Appendix A: Importing data
This Appendix is provided to serve as a non-exhaustive resource for importing data of different file types into R; it extends some of the techniques introduced in the foundational skills chapter, Chapter 6. We note that while the bulk of the data that we use in this book is available through the {dataedu} package, there are cases where you will be importing a .csv file or scraping data from the web.
20.1.1 Using functions to import data
You might be thinking that an Excel file is the first type of data that we would load, but there happens to be a format which you can open and edit in Excel that is even easier to use between Excel and R. This format is also supported by SPSS and other statistical software (like MPlus) and even other programming languages, like Python. That format is .csv, or a comma-separated-values file.
The .csv file is useful because you can open it with Excel and save Excel files as .csv files.
A .csv file contains rows of a spreadsheet with the columns separated by commas, so you can also view it in a text editor, like TextEdit for Macintosh.
Not surprisingly, Google Sheets easily converts .csv files into a Sheet, and also easily saves Sheets as .csv files. However, we would be remiss if we didn’t point out that there is a package, {googlesheets4}, which can be used to read a Google Sheet directly into R.
For these reasons, we start with—and emphasize—reading .csv files. To get there, we will download a file from the internet.
20.1.2 Saving a file from the internet
You’ll need to copy this URL:
https://goo.gl/bUeMhV
Here’s what it resolves to (it’s a .csv file):
https://raw.githubusercontent.com/data-edu/data-science-in-education/master/data/pisaUSA15/stu-quest.csv
This next chunk of code downloads the file to your working directory.
Run this to download it so in the next step you can read it into R.
As a note: there are ways to read the file directly (from the web) into R.
You could also manually do what the next two lines of code do: Feel free to open the file in your browser and to save it to your computer (you should be able to ‘right’ or ‘control’ click the page to save it as a text file with a .csv extension).
student_responses_url <-
"https://goo.gl/bUeMhV"
student_responses_file_name <-
paste0(getwd(), "/data/student-responses-data.csv")
download.file(
url = student_responses_url,
destfile = student_responses_file_name)It may take a few seconds to download as it’s around 20 MB.
The process above involves many core data science ideas and ideas from programming/coding. We will walk through them step-by-step.
- The character string
"https://goo.gl/bUeMhV"is being saved to an object calledstudent_responses_url.
- We concatenate your working directory file path to the desired file name for the
.csvusing a function calledpaste0. This is stored in another object calledstudent_responses_file_name. This creates a file name with a file path in your working directory and it saves the file in the folder that you are working in.
- In short, the
download.file()function needs to know
- where the file is coming from (which you tell it through the
urlargument) and - where the file will be saved (which you tell it through the
destfileargument).
The student_responses_url object is passed to the url argument of the function called download.file(). The student_responses_file_name object is passed to the destfile argument.
Understanding how R is working in these terms can be helpful for troubleshooting and reaching out for help. It also helps you to use functions that you have never used before.
Now, in RStudio, you should see the downloaded file in the Files tab. This should be the case if you created a project with RStudio; if not, it should be whatever your working directory is set to. If the file is there, great. If things are not working, consider downloading the file in the manual way and then move it into the directory of the R Project you created.
20.1.3 Loading a .csv file
Okay, we’re ready to go.
The easiest way to read a .csv file is with the function read_csv() from the package {readr}, which is contained within the tidyverse.
Let’s load the tidyverse library:
You may have noticed the hash symbol after the code that says library(tidyverse). It reads # so tidyverse packages can be used for analysis. That is a comment, and the code after it (but not before it) is not run. The code before it runs normally.
After loading the tidyverse packages, we can now load a file. We are going to call the data student_responses:
Since we loaded the data, we now want to look at it. We can type its name in the function glimpse() to print some information on the dataset (this code is not run here).
If you ran that code, you would see that student_responses is a very big data frame (with a lot of variables with confusing names, to boot)!
Great job loading a file and printing it! We are now well on our way to carrying out analysis of our data.
20.1.4 Saving files
We just practiced loading a file into R from an external data source. Just as often, you might need to save a file out of R into an external software.
Using our data frame student_responses, we can save it as a .csv with the following function. The first argument, student_responses, is the name of the object that you want to save. The second argument, student-responses.csv, is what you want to call the saved dataset.
That will save a .csv file entitled student-responses.csv in the working directory. If you want to save it to another directory, simply add the file path to the file, i.e., path/to/student-responses.csv. To save a file for SPSS, load the {haven} package and use write_sav(). There is no function to save an Excel file, but you can save as a .csv and directly load it in Excel.
20.1.5 Loading Excel files
If you want to load data from an Excel workbook, you might be thinking that you can open the file in Excel and then save it as a .csv. This is generally a good idea. At the same time, sometimes you may need to directly read a file from Excel. Note that, when possible, we recommend the use of .csv files. They work well across platforms and software (i.e., even if you need to load the file with some other software, such as Python).
The package for loading Excel files, {readxl}, is not a part of the tidyverse, so we will have to install it first using install.packages() (remember, we only need to do this once), and then load it using library(readxl). The command to install {readxl} is commented out below so that the computer will not automatically run that line. It is here just as a reminder that the package needs to be installed on your computer before you use it for the first time.
Once we have installed {readxl}, we have to load it (just like tidyverse):
We can then use the function read_excel() in the same way as read_csv(), where “path/to/file.xlsx” is where an Excel file you want to load is located:
Of course, if you were to run this, you can replace my_data with a name you like. Generally, it’s best to use short and easy-to-type names for data as you will be typing and using it a lot.
Note that one easy way to find the path to a file is to use the “Import Dataset” menu. It is in the Environment window of RStudio. Click on that menu bar option, select the option corresponding to the type of file you are trying to load (e.g., “From Excel”), and then click the “Browse” button beside the File/URL field. Once you click on the button, RStudio will automatically generate the file path—and the code to read the file too—for you. You can copy this code or click Import to load the data.
20.1.6 Loading SAV files
The same considerations that apply to reading Excel files apply to reading SAV files (from SPSS).
You can also read a .csv file directly into SPSS. Because of this and because of the benefits of using CSVs (they are simple files that work across platforms and software), we recommend using CSVs when possible.
To load an SPSS file, first, install the {haven} package.
Then, load the data by using the function read_sav():
20.1.7 Google Sheets
Finally, it can sometimes be useful to load a file directly from Google Sheets, and this can be done using the {googlesheets4} package.
When you run the command below, a link to authenticate with your Google account will open in your browser.
You can then use the read_sheet() function to work with your data frame. We provide a brief example below; the package’s documentation provides more details.
20.2 Appendix B: Social network influence and selection models
Behind the social network visualizations explored in the chapter on social network analysis, Chapter 12, there are also statistical models that can be used to further understand relationships in a network.
One way to consider these models and methods is by considering selection and influence, two processes at play in our relationships. These two processes are commonly the focus of statistical analyses of networks. Selection and influence do not interact independently: they affect each other reciprocally (Xu, Frank, and Penuel 2018). Let’s define these two processes:
- Selection: the process of choosing relationships
- Influence: the process of how our social relationships affect behavior
While these processes are complex, it is possible to study them using data about people’s relationships and behavior. Happily, the use of these methods has expanded along with R. In fact, long-standing R packages have become some of the best tools for studying social networks. Additionally, while there are many nuances to studying selection and influence, these are models that can be carried out with relatively simple modeling techniques like linear regression.
After getting familiar with using edgelists and visualizations in the chapter on social network analysis, Chapter 12, a good next step is learning about selection and influence. Let’s look at some examples:
20.2.1 An example of influence
First, let’s look at an example of influence. To do so, let’s create three different data frames. These will include:
- An edgelist data frame that contains the nominator and nominee for a relationship. For example, if Stefanie says that José is her friend, then Stefanie is the nominator and José the nominee. Data frames like this can also contain an optional variable indicating the weight, or strength, of their relation
- Data frames indicating the values of some behavior—an outcome—at two different time points
In this example, we’ll create example data we can use to explore questions about influence.
Let’s take a look at our three datasets:
data1: an edgelist that contains a nominator, nominee, and strength of the relationdata2: a dataset that contains the nominee and the values of some behavior at the first time point
data3: a dataset that contains a nominator and the value of some behavior at the second time point
Note that we will find each nominator’s outcome at time 2 later on. Here’s how we can make these example datasets:
data1 <-
data.frame(
nominator = c(2, 1, 3, 1, 2, 6, 3, 5, 6, 4, 3, 4),
nominee = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6),
relate = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
data2 <-
data.frame(nominee = c(1, 2, 3, 4, 5, 6),
yvar1 = c(2.4, 2.6, 1.1, -0.5, -3, -1))
data3 <-
data.frame(nominator = c(1, 2, 3, 4, 5, 6),
yvar2 = c(2, 2, 1, -0.5, -2, -0.5))20.2.2 Joining the data
Next, we’ll join the data into one data frame. This step can be time-consuming for large network datasets, but it’s important for the visualizations and analysis that follow. The more time you can invest into preparing the data properly, the more confidence you’ll have that your resulting analysis is based on a deeper understanding of the data.
data <-
left_join(data1, data2, by = "nominee")
data <-
data %>%
# this makes merging later easier
mutate(nominee = as.character(nominee))
# calculate indegree in tempdata and merge with data
tempdata <- data.frame(table(data$nominee))
tempdata <-
tempdata %>%
rename(
# rename the column "Var1" to "nominee"
"nominee" = "Var1",
# rename the column "Freq" to "indegree"
"indegree" = "Freq"
) %>%
# makes nominee a character data type, instead of a factor, which can cause problems
mutate(nominee = as.character(nominee))
data <-
left_join(data, tempdata, by = "nominee")20.2.2.1 Calculating an exposure term
Next we’ll create an exposure term. This is the key step that makes this linear regression model special. The idea is that the exposure term “captures” how your interactions with someone over the first and second time points impact an outcome. The model describes a change in this outcome because it takes the first and second time points into account.
# Calculating exposure
data <-
data %>%
mutate(exposure = relate * yvar1)
# Calculating mean exposure
mean_exposure <-
data %>%
group_by(nominator) %>%
summarize(exposure_mean = mean(exposure))The data frame mean_exposure contains the mean of the outcome (in this case, yvar1) for all of the individuals the nominator had a relation with.
Let’s process the data more so we can add the variables exposure_mean, yvar1, and yvar2.
20.2.2.2 Regression (linear model)
Calculating the exposure term is the most distinctive and important step in carrying out influence models. Now, we can use a linear model to find out how much relations—as captured by the influence term—affect some outcome. While this code is not run here, you could run the code in this appendix to see the results (and how changes in how the exposure term is calculated, such as by finding the sum, instead of the mean, of each individual’s exposures, impact the results).
So, the influence model is used to study a key process for social network analysis. It’s useful because it’s one way you can quantify the network effect. This is a metric that is not always considered in education, but we hope to see more of it (Frank 2009). It also helps that it can be done with a relatively straightforward regression model.
20.2.3 An example of selection
Let’s look at selection models next. Information from selection models can be useful to a wide audience—administrators, teachers, and students—because it describes how members of a network choose who to interact with. Here, we briefly describe a few possible approaches for using a selection model to learn more about a social network.
In the last section we used a linear regression model. In this example we’ll use a logistic regression model. Logistic regressions model outcomes that are either a 0 or a 1. Thus, the most straightforward way to use a selection model is to use a logistic regression where all of the relations (note the relate variable in data1 above) are indicated with a 1.
But here is the important and challenging step: all of the possible relations between members of a network are indicated with a 0 in an edgelist. Recall that an edgelist is the preferred data structure for carrying out this analysis. This step requires that we prepare the data by lengthening and widening it.
Once all of the relations are given a value of either a 1 or a 0, then a logistic regression can be used. Imagine that we are interested in whether individuals from the same group are more or less likely to interact than those from different groups. To answer this question, one could create a new variable called same and then fit the model using code (which is not run, but is included as an example of the code for this kind of selection model) like this:
While this is a straightforward way to carry out a selection model, there are some limitations. First, it doesn’t account for the amount of nominations an individual sends. Not considering this may mean other effects, like the one associated with being from the same group, are not accurate. Some R packages aim to address this by considering other variables like relationship weights. Here are some examples:
- The {amen} (R-amen?) package can be used for data that is not only 1s and 0s—like a logistic regression—but also data that is normally distributed
- The Exponential Random Graph Model, or {ergm} R package, makes it easy to use these kinds of selection models. {ergm} (R-ergm?) is itself a part of a powerful and often-used collection of packages for social network analysis, {statnet} (R-statnet?)
These packages are examples of the richness R packages can bring to using social network analysis models and methods. As developments in social network analysis methods continue, more cutting-edge techniques and R packages will be available.
20.3 Appendix C: Colophon
This book was written using bookdown (Xie, 2016) using RStudio (RStudio Team, 2015). The website (https://datascienceineducation.com) is hosted with Netlify (https://www.netlify.com/).
This version of the book was built with:
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Sequoia 15.7.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: UTC
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
## [5] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
## [9] ggplot2_4.0.3 tidyverse_2.0.0 png_0.1-9
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.6.0 tidyselect_1.2.1
## [5] jquerylib_0.1.4 scales_1.4.0 yaml_2.3.12 fastmap_1.2.0
## [9] R6_2.6.1 generics_0.1.4 knitr_1.51 bookdown_0.46
## [13] tzdb_0.5.0 bslib_0.10.0 pillar_1.11.1 RColorBrewer_1.1-3
## [17] rlang_1.2.0 stringi_1.8.7 cachem_1.1.0 xfun_0.57
## [21] sass_0.4.10 S7_0.2.2 otel_0.2.0 timechange_0.4.0
## [25] cli_3.6.6 withr_3.0.2 magrittr_2.0.5 digest_0.6.39
## [29] grid_4.6.0 hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.3
## [33] evaluate_1.0.5 glue_1.8.1 farver_2.1.2 rmarkdown_2.31
## [37] tools_4.6.0 pkgconfig_2.0.3 htmltools_0.5.9