12 Walkthrough 6: Exploring Relationships Using Social Network Analysis With Social Media Data

12.1 Vocabulary

  • Application Programming Interface (API)
  • edgelist
  • edge
  • influence model
  • regex
  • selection model
  • social network analysis
  • sociogram
  • vertex

12.2 Chapter Overview

This chapter builds on [Walkthrough 5/Chapter 11]](#c11), where we worked with #tidytuesday data. In the previous chapter we focused on using text analysis to understand the content of tweets. In this, we chapter focus on the interactions between #tidytuesday participants using social network analysis techniques. And like the previous chapter, we’ve also included an appendix (Appendix C) to introduce some social network-related ideas for further exploration.

12.2.1 Background

There are a few reasons to be interested in social media. For example, if you work in a school district, you may want to know who is interacting with the content you share. If you are a researcher, you may want to investigate what teachers, administrators, and others do through state-based hashtags (e.g., Rosenberg et al. (2016)). Social media-based data also provides new contexts for learning to take place, like in professional learning networks (Trust et al., 2016).

In the past, if a teacher wanted advice about how to plan a unit or to design a lesson, they would turn to a trusted peer in their building or district (Spillane et al., 2012). Today they are as likely to turn to someone in a social media network. Social media interactions like the ones tagged with the #tidytuesday hashtag are increasingly common in education. Using data science tools to learn from these interactions is valuable for improving the student experience.

12.2.2 Packages, Data Sources and Import, and Methods

In this chapter, we access data using the {rtweet} package (Kearney, 2016). Through {rtweet} and a Twitter account, it is easy to access data from Twitter. We will load the {tidyverse} and {rtweet} packages to get started.

We will also load other packages that we will be using in this analysis, including two packages related to social network analysis (Pedersen, 2019, @R-ggraph) as well as one that will help us to use not-anonymized names in a savvy way (Betebenner, 2019).

Here is an example of searching the most recent 1,000 tweets which include the hashtag #rstats. When you run this code, you will be prompted to authenticate your access via Twitter.

As described in the previous chapter, you can easily change the search term to other hashtags terms. For example, to search for #tidytuesday tweets, we can replace #rstats with #tidytuesday:

12.3 View Data

We can see that there are many rows for the data:

## [1] 4418

12.4 Process Data

Network data requires some processing before it can be used in subsequent analyses. The network dataset needs a way to identify each participant’s role in the interaction. We need to answer questions like: Did someone reach out to another for help? Was someone contacted by another for help? We can process the data by creating an “edgelist”. An edgelist is a dataset where each row is a unique interaction between two parties.

An edgelist looks like the following, where the sender (sometimes called the “nominator”) column identifies who is initiating the interaction and the receiver (sometimes called the “nominee”) column identifies who is receiving the interaction:

## # A tibble: 12 x 2
##    sender              receiver            
##    <chr>               <chr>               
##  1 al-Bangura, Waseema al-Jabbour, Hamda   
##  2 Taylor, Morgan      Gardner, Latasha    
##  3 Taylor, Morgan      Bekerman, Hannah    
##  4 Duran, Austin       Gardner, Latasha    
##  5 Duran, Austin       al-Jabbour, Hamda   
##  6 Duran, Austin       Ramirez, Joshua     
##  7 Garcia, Kaitlyn     Bekerman, Hannah    
##  8 Garcia, Kaitlyn     Oviedo, Maximilliano
##  9 Garcia, Kaitlyn     Ramirez, Joshua     
## 10 Ramos, Jeremiah     Larkin, Stephanie   
## 11 Payne, Brantley     Bekerman, Hannah    
## 12 Payne, Brantley     Larkin, Stephanie

In this edgelist, the sender column might identify someone who nominates another (the receiver) as someone they go to for help. The sender might also identify someone who interacts with the receiver in other ways, like “liking” or “mentioning” their tweets. In the following steps, we will work to create an edgelist from the data from #tidytuesday on Twitter.

12.4.1 Extracting Mentions

Let’s extract the mentions. There is a lot going on in the code below; let’s break it down line-by-line, starting with mutate():

  • mutate(all_mentions = str_extract_all(text, regex)): this line uses a regex, or regular expression, to identify all of the usernames in the tweet (note: the regex comes from from this Stack Overflow page (https://stackoverflow.com/questions/18164839/get-twitter-username-with-regex-in-r))
  • unnest(all_mentions) this line uses a {tidyr} function, unnest() to move every mention to its own line, while keeping all of the other information the same (see more about unnest() here: https://tidyr.tidyverse.org/reference/unnest.html)).

Now let’s use these functions to extract the mentions from the dataset. Here’s how all the code looks in action:

Let’s put these into their own data frame, called mentions.

12.4.2 Putting the Edgelist Together

Recall that an edgelist is a data structure that has columns for the “sender” and “receiver” of interactions. Someone “sends” the mention to someone who is mentioned, who can be considered to “receive” it. To make the edgelist, we’ll need to clean it up a little by removing the “@” symbol. Let’s look at our data as it is now.

## # A tibble: 2,447 x 2
##    sender  all_mentions    
##    <chr>   <chr>           
##  1 cizzart @eldestapeweb   
##  2 cizzart @INDECArgentina 
##  3 cizzart @ENACOMArgentina
##  4 cizzart @tribunalelecmns
##  5 cizzart @CamaraElectoral
##  6 cizzart @INDECArgentina 
##  7 cizzart @tribunalelecmns
##  8 cizzart @CamaraElectoral
##  9 cizzart @AgroMnes       
## 10 cizzart @AgroindustriaAR
## # … with 2,437 more rows

Let’s remove that “@” symbol from the columns we created and save the results to a new tibble, edgelist.

12.5 Analysis and Results

Now that we have our edgelist, let’s plot the network. We’ll use the {tidygraph} and {ggraph} packages to visualize the data.

12.5.1 Plotting the Network

Large networks like this one can be hard to work with because of their size. We can get around that problem by only include some individuals. Let’s explore how many interactions each individual in the network sent by using count():

## # A tibble: 618 x 2
##    sender            n
##    <chr>         <int>
##  1 thomas_mock     347
##  2 R4DScommunity    78
##  3 WireMonkey       52
##  4 CedScherer       41
##  5 allison_horst    37
##  6 mjhendrickson    34
##  7 kigtembu         27
##  8 WeAreRLadies     25
##  9 PBecciu          23
## 10 sil_aarts        23
## # … with 608 more rows

618 senders of interactions is a lot! What if we focused on only those who sent more than one interaction?

That leaves us with only 349, which will be much easier to work with.

We now need to filter the edgelist to only include these 349 individuals. The following code uses the filter() function combined with the %in% operator to do this:

We’ll use the as_tbl_graph() function, which identifies the first column as the “sender” and the second as the “receiver.” Let’s look at the object it creates:

## # A tbl_graph: 267 nodes and 975 edges
## #
## # A directed multigraph with 7 components
## #
## # Node Data: 267 x 1 (active)
##   name           
##   <chr>          
## 1 dgwinfred      
## 2 datawookie     
## 3 jvaghela4      
## 4 FournierJohanie
## 5 JonTheGeek     
## 6 jakekaupp      
## # … with 261 more rows
## #
## # Edge Data: 975 x 2
##    from    to
##   <int> <int>
## 1     1    32
## 2     1    36
## 3     2   120
## # … with 972 more rows

We can see that the network now has 267 individuals, all of which sent more than one interaction.

Next, we’ll use the ggraph() function:

Network Graph

Figure 12.1: Network Graph

Finally, let’s size the points based on a measure of centrality. A common way to do this is to measure how influential an individual may be based on the interactions observed.

Network Graph with Centrality

Figure 12.2: Network Graph with Centrality

There is much more you can do with {ggraph} (and {tidygraph}); check out the {ggraph} tutorial here: https://ggraph.data-imaginist.com/

12.6 Conclusion

In this chapter, we used social media data from the #tidytuesday hashtag to prepare and visualize social network data. This is a powerful technique that can reveal who is interacting with whom in some cases can suggest why.

Behind these visualizations there are also statistical models and methods that help to further understand relationships in a network.

One way to consider these models and methods is by considering selection and influence, two processes at play in our relationships. These two processes are commonly the focus of statistical analyses of networks. Selection and influence do not interact independently: they affect each other reciprocally (Xu, Frank, & Penuel, 2018). Let’s define these two processes:

  • Selection: the process of choosing relationships
  • Influence: the process of how our social relationships affect behavior

While these are processes are complex, it is possible to study them using data about people’s relationships and behavior. Happily, the use of these methods has expanded along with R. In fact, long-standing R packages have become some of the best tools for studying social networks. Additionally, while there are many nuances to studying selection and influence, these are models that can be carried out with relatively simple modeling techniques like linear regression. We describe these in Appendix C, as they do not use the tidytuesday dataset and are likely to be of interest to readers after mastering the preperation and visualization of network data.