GithubHelp home page GithubHelp logo

fergusonproject's Introduction

                                                   Abstract

Twitter has frequently served as a platform for political and social discussion, playing a notable role in organizing protests, serving as a form of civil disobedience, and being used as a means of expressing outrage over social events. We seek to further research aspects of Twitter conversations (specifically their use of Twitter hashtags), which allow them to remain connected and bring together individuals and groups with differing opinions. Our hope is that, by finding these characteristics of Twitter hashtags, we can further elucidate methods of sustaining debate and discussion around political and social movements as well as bridging various communities of thought. We focus specifically on the Twitter conversations related to the August 2014 shooting of 18-year old Michael Brown, which have been centered on the hashtag #ferguson. We generate both undirected graphs and multigraphs to represent this Twitter conversation using co-occurring hashtags. We run two experiments on our graphs in order to investigate the effect on connectivity of node influence, edge presence and frequency, as well as node presence and frequency.

                                              Problem Definition

Beginning with a database of 1.3 million tweets surrounding the #ferguson hashtag, we examine each tweet and extract the hashtags from within the text. We use the term ”co-occurrence” to refer to when a pair of hashtags both appear in a given tweet. While processing our data, we also keep track of the number of times each co-occurrence, defined by a pair of hashtags, occurs. By rendering a graph of hashtags as nodes, linked with a weighted (or multi-graph) edge representing the rate of co-occurrence, we are able to look for further insight into the following questions:

  1. Does the ‘strength of weak ties’ concept apply for Twitter hashtags? In other words, are tweets sharing less frequently co-occurring, or more frequently co-occurring, hashtags keeping widespread conversations like #ferguson connected?
  2. What role do the most-popular hashtags, such as #ferguson, play in keeping a conversation going? In other words, if it were not for these extremely common hashtags ‘anchoring’ a conversation, would the related tweets no longer be connected, or might they be connected with a more diverse collection of hashtags?

We want to consider the role that hashtags play in keeping a conversation connected and widespread, rather than a single hashtag’s lifetime in the network. These questions connect two areas of research: the first examines what makes hashtags or blog posts viral, persistent, influential, and sticky (i.e. they catch on quickly after relatively few impressions). The second is about the global efficiency principle (the weight of a link should correlate with the number of shortest paths passing through it), the strength of weak ties , and the (alternative) dyadic hypothesis (the strength of a link is independent of its network context) [4]. Prior work on this data set has indicated that there are clear clusters or sub-conversations linked to the #ferguson hashtag [6].

                                                  Data Model

We rendered our data as a network with the underlying format (hashtag1, hashtag2, co-occurrence-count), where each hashtag is a node. Therefore, each node id is a hashtag id, and the edge weight is the co- occurrence-count). Given this underlying representation, we represent the co-occurrence network in the following way: Undirected Weighted Graph Network: An edge between two nodes represents that there is least one co-occurrence of the two hashtags represented by the two nodes, and the edge has an attribute that is the count of tweets those hashtags co-occurred in. Hashtags in the dataset that never co-occurred with another hashtag are represented as zero degree nodes. We also considered a version that excluded zero-degree nodes. We did not find significant differences between these two in terms of global attributes on the graph, except, of course the number of zero-degree nodes, so we used the representation including these zero-degree nodes since we are studying the role hashtags play in connectivity on their own and in co-occurring pairs. In order to understand the role co-occurring hashtags and individual hashtags play in the graph’s connectivity, we conducted two different experiments to measure connectivity as a function of particular removed edges and nodes:

Experiment 1 (edge removal): We removed edges in increasing order of their weight (which is their relative co-occurrence rate). We defined a timestep to be the removal of 1/100th of edges in the original graph. At each timestep, we measured the estimated diameter, number of connected components, and the fraction of the largest connected component of the remaining graph. We repeated this process two additional times. In the second occurrence of this experiment, we removed edges in decreasing order of their weight and in the final occurrence, we removed edges in random order.

Experiment 2 (node removal): We removed nodes in increasing order of their frequency in the original dataset. We defined a timestep to be the removal of 1/100th of nodes in the original graph. At each timestep, we measured the estimated diameter, number of connected components, and the fraction of the largest connected component of the remaining graph. We repeated this process two additional times. In the second occurrence of this experiment, we removed nodes in decreasing order of their frequency and in the final occurrence, we removed nodes in random order.

                                              Results and Findings 
                                              
          Graph attributes

We will further expand upon the significance of various statistics within our graph (focusing on the graph including zero degree nodes) which give context to our later results and analysis. The 90% effective diameter was 2.7, the degree distribution appears to follow a power law (see Figure 1), and the fraction of triads that are closed is very small: .1%. These, combined with trends found in our experiments described below, suggest that the co-occurrence graph resembles a preferential-attachment model, where for any given hashtag, a tweeter is most likely to pair it with a hashtag that is already paired with many others. The intuition here is that if a tweeter wants to maximize the influence of his or her tweet, he or she will include a hashtag that many tweeters pay attention to, and more frequent hashtags are more likely to have many tweeters’ attention. Entries with high influence scores are depicted in the next section about experiment findings.

                                        Experiment 1: Edge Removal

Removing edges in decreasing order of weight produced a connectivity pattern almost identical to random removal. Removing connections between pairs of hashtags with high co-occurrence before those with low co-occurrence did not change the connectivity of the graph. Removing edges in increasing order of co-occurrence weight produced a pattern different than that of removing nodes randomly or in decreasing order. This may suggest that relatively infrequently co-occurring hashtags may play a partial role in graph connectivity. Removing connections between pairs of hashtags with low co-occurrence before those connections with higher co-occurrence caused an acceleration in the disintegration of the graph about halfway through the decay process. We randomly sampled 10 pairs of hashtags whose link was removed at each timestep. Examples of pairs whose link was removed early were relatively peripheral to the discussion: (autism dontblameautism), (drones, fisa), (fluiddymamics, yikes), (assaultrifle, jihad), (icebucketchallenge, ukraine), (ferguson, ramproud), (comcast, nsa), (darfur, thailand).

figure 1

Figure 1: The degree distribution of the co-occurrence graph appears to follow a power-law. Those removed in the middle, causing the accelerated decay in the ’increasing weight’ removal order, either relate #ferguson to other discussions such as (kiev, ferguson), (drones, revolution), (womensequalityday, michaelbrown), (mikebrown,neonazizionst); or represent less popular hashtags within the #ferguson conversation, such as the pair ”ripmikebrown, stopdontshoot” which is similar to the more popular ”mikebrown, handsupdontshoot” tag pairing, or pairs including lesser known names of victims of police violence like Kajieme Powell, John Crawford, and Ezell Ford. Tag pairs removed at the very end are pairs of well known hashtags used to refer to the movement against police violence in black communities like (blacklivesmatter, every28hours) and (handsupdontshoot, ineedanswers). This suggests that at some scale, relatively rare pairs of hashtags are connecting relatively unrelated conversations, like ”kiev,” ”womensequalityday,” and ”neonazizionst” that would otherwise never be connected. However, it is not a strong effect, so there are clearly other factors at play in graph connectivity, suggesting that the strength of weak ties hypothesis applies to hashtag cooccurrence much less than it applies to human communication networks.

                                         Experiment 2: Node Removal

Removal in decreasing order of frequency:

Removing nodes in decreasing order of frequency (in green), meant that the most common hashtags such as #ferguson are removed first. The behavior here suggests that hashtag co-occurrence follows a preferential- attachment pattern, with hub nodes (rather than triangles) keeping the graph connected. By examining the green lines in Figure 2, we can see the graph breaking into smaller, disconnected subgraphs much more quickly than with either increasing frequency or random note removal, and edge removal strategies. This suggests that the most frequent hashtags keep the original graph connected, and actually act as hub nodes. With decreasing removal order, the size of the maximum connected component indicates that the graph is broken into very small components after about 12 timesteps (green line in 3rd plot). In the diameter plot, we get another perspective on how the graph rapidly falls apart with the decreasing removal pattern: as the size of the maximum connected component falls steadily in the first 10 timesteps, the diameter steadily increases before crashing around the time all connected components become very small. The number of connected components rises in the beginning of the process, as the discussion breaks into sub-discussions, and then falls as lower-frequency nodes that form small connected components are removed; high frequency hashtags are of course more likely to co-occur and therefore be in larger connected components. Low-frequency nodes are more likely to start off in small connected components (including those of size 1). These measurements indicate that a small number of relatively significant sub-discussions become disconnected from each other with the removal of the most frequent hashtags. Those sub-discussions seem to be internally anchored by ‘minor hubs’ with mid-level frequencies, and then after those are removed the diameter drops as well be- cause the remaining nodes are in very small connected components. This experiment strongly suggests a preferential-attachment structure with a few classes of ‘hubs,’ consistent with the lack of closed triads and power-law distribution of degrees.

figure 2

Figure 2: This plot displays the measurements of diameter, number of connected components (SCCs), and the size of the largest connected component relative to the graph at that time, taken at each timestep in edge removal. The red line represents removal in order of increasing weight, the green decreasing, and the blue random.

Removal in increasing order of frequency:

Removing nodes in this order meant that the size of the largest connected component only increased, and the number of connected components only decreased. This suggests a high correlation between a node’s frequency in the original dataset of tweets and its centrality in the co-occurrence graph. Removing in this order was equivalent to repeatedly removing nodes with the lowest centrality at each timestep. The fact that the diameter remained fairly constant until the very end of the decay process is consistent with this pattern.

Removal in random order of frequency:

The steady diameter, similar to the increasing order node removal strategy, suggests that the structure of the graph is similar at many scales; except where the drop off in max-scc size indicates the graph quickly ‘fell apart’ at the loss of a key hub node (likely #ferguson).

                          Influence Scores for Measuring Hashtag/Co-Occurrence Rank

In addition to measuring decomposition of the graph, we wanted to keep track of the actual types of data being removed at each timestep. This was done to balance the correlation between experimental data and useful real-world application. For instance, in running the node analysis experiment and removing higher frequency hashtags we see a dramatic decay in the graph early on (Figure 3, Green Line). From first glance one might have an intuition that the graph is breaking down rapidly because the most commonly reoccurring hashtags are removed. This is not the case; it happens as result of the most ”influential” nodes being removed which will be made clear within this section. First, let’s make the distinction that the highest weighted (most frequently occurring) hashtag or co-occurrence does not imply that the hashtag or co-occurrence is the most influential. Assume we have a cluster of N Twitter users and the most heavily weighed hashtag or co-occurrence appears N times, but each time it only appears to the same 1 person (without loss of generality assume N >> 1). In the same network let’s say we have a hashtag or co-occurrence that appears once but is seen by N users. The most influential data point is the one which appeared to N users according to the preferential attachment theory from Cunha et al. (referenced in section 3) and the influence maximization algorithm from Agarwal et al [1][2]. The notion of an Influence Score came from Agarwal et al in their attempt to quantify the influence of a blog post. The algorithm took an additive approach by allowing for multiple parameters to influence an influence score, such as length of a post, number of inlinks, number of outlinks, and the ’goodness’ of the post.[1] In quantifying the influence of hashtags and co-occurrences, we will take a similar approach. In our case let’s define the Influence Score, I, of a hashtag or co-occurrence, a, as:

screen shot 2017-06-02 at 12 20 15 am

where t = single tweet that includes a, n = total number of tweets that include a, αt = number of retweets for tweet t, and βt = number of followers for the profile that produced tweet t . Also note that αt is a partial influence score. The influence score is the average of all of the partial influence scores. The βt α ratio β helps to keep the score standard over varying sample spaces. For instance if hashtag a appeared in a tweet which had 40 retweets and was shared out to 100 users, it should have the same score as if the hashtag appeared in a tweet with 4 retweets but only reached 10 users. The idea is to measure the impact of the hashtag given the number of followers. The same courtesy extends to co-occurrences. One edge case is if β = 0 and α > zero then we set the influence score equal to α + 1. The logic behind this goes as follows: Say we have N retweets and 1 follower (assuming the follower retweets once and N >> 1). The partial influence score would be N . Now say we still have N retweets but 0 followers then N = N + 1. We say the second partial influence score here is higher because theoretically the number of retweets with respect to the number of followers is higher (which is the quantity we want to measure). We take a partial influence score for each unique tweet a hashtag and co-occurrence appears in. To get the total influence score for a hashtag or co-occurrence, one must compute the average of its partial influence scores. Another subtlety to note is that for a unique tweet the same partial influence score is assigned to every individual hashtag and any combination of multiple hashtags (co-occurrences) in that tweet. Thus, the hashtags or co-occurrences that are most influential are the ones with the highest average influence scores. Note that this score does not depend strictly on the frequency of a hashtag or co-occurrence.

fergusonproject's People

Contributors

christky avatar devneyhamilton avatar omosola avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.