Introduction


Idea


“Characterize social networks of communities based on sentiment analysis of tweets from their core and peripheral members.”


While brainstorming for an idea, we asked ourselves who is using Twitter? We felt that in Germany it is mostly used by smaller communities like musicians, politicians or journalists. Those communities usually have two kinds of members: the ones actively participating, e.g. professional musicians, and the more peripheral members, e.g. the fans. Twitter is a medium to bring those two groups together. We thought it might be interesting to characterize a social network of a community based on tweets by either one of the groups and compare them.


Two members of our team are german hip-hop enthusiasts and as known from media this is a culture where disputes are openly discussed among other things on Twitter. The idea was to collect tweets about hip-hop artists written by the artists themselves and their fans. Further use those tweets for a sentiment analysis and create a social network graph based on the results.


The two main questions are:

  1. Based on the tweets of hip-hop artists, can we tell who is friend or foe with whom?
  2. Does the social graph based only on tweets of the fans translate to the graph we generate with the core members data?



Approach


To be able to compare the core member’s data (VIP) with the peripheral members’ (plebs), we developed two different approaches to plot a social network graph.


Based on VIP data from hip-hop artists:
For each VIP we collected data concerning other VIPs of the community. We analyzed the sentiment of each tweet and normalized its parameters (follow, mention, reply, and retweet) as explained below to calculate a friendship-score for each VIP-to-VIP-relationship and compiled them in a VIPxVIP-relationship matrix.


Based on pleb data from hip-hop fans:
For each VIP we collected tweets about them via the Twitter Search API. As we did with the tweets of the VIPs, we analyzed the sentiment and looked up whom of the VIPs the tweet’s author is following. For each of the VIPs followed by the author, we applied a friendship-score concerning the mentioned VIP and created another VIPxVIP-relationship matrix.


We then used t-SNE - an algorithm for dimensionality reduction and clustering- in combination with d3.js to visualize the social network graph with the input of the relationship matrices.


approach


Collecting Data


Input


The only manual step is to collect a list of Twitter accounts of the core-members of a given community. We thank @hiphopDE for following almost all of the german hip-hop artists, so we could just collect their friends list as a starting point.



Crawling


To gather all the data we needed for our project, we wrote a custom crawler for the Twitter API. Amongst different possibilities we decided to use Java in combination with the Twitter4j API client as basis for our crawler. At first our plan was to crawl all the VIPs first, then clean the database from unnecessary information and afterwards crawl the fans of the VIPs and make another cleaning step in the end. By doing this for a few weeks, we ended up with a bunch of different database versions and copies. In the end, nobody really knew which version contained the latest data. After this, we decided to change the crawling process completely for further crawling. And it worked out well. First of all, we did not crawl VIPs and fans separately anymore. Secondly, we built an executable jar file, which takes a list of Twitter accounts (VIPs) as an input and crawls their data into a database. The crawler crawls one VIP after another. Each VIP together with all of their fans.
We distributed the jar file to everyone in our team, split the list of all VIPs into parts, and then everyone crawled their part of the list simultaneously (see picture) into their own local version of the database. Afterwards, we wrote an SQLITE Script to merge all the different databases back into one database which contains the data from the whole crawling Process.


crawling


Database


As mentioned before, we store the data from the crawling process in an SQLite Database. The diagram below shows that our database contains two types of data. The VIP data is spread across the tables vip, vipFriends, vipTweets and vipTweetmentions. The data for the plebs is stored in the tables plebTweets, plebTweetmentions and plebFriends. Because of the big amount of tweets we have crawled, queries that aggregate data are pretty slow. To solve this problem, we sped up our queries by enhancing our database with several index structures.

database

Sentiment Analysis


To analyse the sentiments of a tweet we looked into multiple options and decided on SentiStrength in the end. According to its website “SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:
-1 (not negative) to -5 (extremely negative)
+1 (not positive) to +5 (extremely positive)
Why does it use two scores? Because research from psychology has revealed that we process positive and negative sentiment in parallel - hence mixed emotions. a tool to analyse the sentiment of a given text. It can be used on short messages like tweets as well and even has emoji/emoticon tables. For specific slang words it is easy extensible.”


For example the text 'I love you but hate the current political climate.', has positive strength 3 and negative strength -4.
We used the german library provided by Hannes Pirker, Interaction Technologies Group at the Austrian Research Institute for Artificial Intelligence (OFAI).
As we get two sentiments for each tweet we decided not to offset them, but generate two maps at the end, to see which artists are the closest to and furthest from each other.





Calculating Relationship-Matrix


Relationship-Matrix


A relationship matrix is a VIPxVIP matrix, where each row contains the friendship-scores of a VIP i to k, where . The index s corresponds to the used sentiscore and can be positive or negative. The friendship-score is calculated with the help of the following parameters derived from Twitter data:


VIP
VIP i follows another VIP k:


VIP i retweets another VIP k:


VIP i replies to another VIP k:


VIP i mentions another VIP k:


Pleb
For each VIP i,k (k≠i) that a Pleb follows we calculate .
For each VIP i that a pleb follows who mentions VIP k we calculate


Normalization of the Friendship-Parameter


To norm the VIP data, we needed to address the issue of some VIPs tweeting more often than others. Therefore, we divided the given ‘tweet’-sentiscores for each parameter through the sum of all the ‘tweet’-sentiscores allocated to all tweets of the given VIP. For the ‘following’-score we divided through the total number of VIPs followed by the given VIP.


VIP i follows another VIP k:


VIP i retweets another VIP k:


VIP i replies to another VIP k:


VIP i mentions another VIP k:



To norm the pleb data we needed to address the issue that some VIPs are tweeted about more often than others. Therefore, we divided the given ‘tweet’-sentiscores for each parameter through the total sum of ‘tweet’-sentiscores allocated to all tweets by all plebs to the given VIP. For the ‘following-score’ we divided through the total number of VIPs followed by the given pleb analog to the ‘following’-score of VIPs.


For each VIP i,k (k≠i) that pleb p follows:


Pleb mentions VIP k and follows VIP i:



Calculation of the Friendship-Score


After the normalization we get values between 0 and 1 for each parameter. We added weights to improve the results since following someone should be worth more than just mentioning them in a tweet.


VIP
positive:


negative:


Pleb
Sum up the values for all plebs:


negative:


We used the relative order as follows:






Visualization


t-SNE


t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for dimensionality reduction developed by Laurens van der Maaten and Geoffrey Hinton.[http://lvdmaaten.github.io/tsne/]. We generated four relationship matrices for our HipHop-data set which diverged on the following points: data pulled from VIP or Plebs, and data calculated from negative or positive tweets. These matrices are given as raw data input to the javascript-implementation of t-SNE provided by Andrej Karpathy.


Andrej Karpathy's implementation offers a function step which when executed improves upon the result. The result being a two dimensional array featuring the x and y coordinates of each object (here VIP). Close points signifying similarity and distant points dissimilarity. We map these coordinates to the size of the visualization and visualize them on the website with D3. Each step is executed with a short delay to show through translations how the result improves.



D3


D3 stands for Data-Driven Documents. It is a JavaScript library used to produce dynamic, interactive data visualization in web browsers using HTML, SVG, and CSS. With D3 you can bind arbitrary data to a DOM (Document Object Model), and then transform the document based on this data.



Website


website

On our website the user can choose between two data sets: HipHop and Politics [1]. For each data set the user is given two tables with four possible visualizations: VIP BFF, VIP BeeF, Pleb BFF, and Pleb BeeF [2]. The different visualizations are explained in a tooltip on the upper left of the website [3]. It is possible to switch between circle and Twitter profile picture as representations of the VIPs [4].




Results


BFF-Map VIPs


This BFF(“Best Friends Forever”) map is calculated based on the data we extracted from the Twitter accounts of our VIPs only. It is designed to visualize positive relationships. The nearer two points are two each other, the closer is their estimated friendship.


VIP BFF

  1. Allesodernix - Label
    Allesodernix is the Twitter account of a german hip-hop label. It is surrounded by its contracted artists, which makes sense, as they usually have a good relationship within a specific label.

  2. Marsimoto - Marteria
    Masimoto is a character played by the artist Marteria. As they are the same person, it makes sense that both Twitter accounts almost lie on top of each other.

  3. RAF Camora, Chakuza, Joshimixu, (Pedaz)
    RAF, Chakuza and Joshimixu are friends in real life and even made a collaboration LP together.

  4. Die Orsons - Tua, Bartek, Maeckes, Kaas
    Die Orsons is a hip-hop group with Tua, Bartek, Maeckes and Kaas as its members. They are obviously friends, so all members are close to each other.



BFF-Map Plebs


This BFF(“Best Friends Forever”) map is calculated based on the data we extracted from the Twitter accounts of the plebs (fans) only. It is designed to visualize positive relationships. The nearer two points are two each other, the closer is their estimated friendship based on the Twitter analysis of their fans.


Pleb BFF

  1. Die Orsons - Tua, Bartek, Maeckes, Kaas
    The Orsons can be found again. This makes sense as fans of the band usually like its members as well.

  2. Trailerpark
    Trailerpark is another label where the members are close to each other.

  3. Popular german hip-hop
    This cluster is merely a genre of the new era of popular german hip-hop. Although some of the artists don’t share a friendly connection (anymore), they seem to be liked by the same fanbase, anyway (Alligatoah-Cro). Prinz Pi, KIZ, Casper, Marteria and Kraftklub are good friends and this seems to translate to their fanbase.

  4. Marteria - Marsimoto
    Music by Marsimoto is quite different from Marteria. They are still close but not lying directly on top of each other anymore. This leads to the conclusion that not all fans of the one are also fans of the other, however, many are still fans of both.



BeeF-Maps


The BeeF-Map is the exact opposite of the BFF map. It is designed to visualize antipathy amongst the VIPs. VIPs that don’t like each other should be shown close together. Unfortunately this approach doesn’t work very well. We found that the VIPs are spread quite randomly in both the BeeF-Map based on VIP data and the BeeF-Map based on Pleb data. We have multiple explanations for this.
Firstly, the SentiStrength tool we use is sometimes not very accurate.
It isn’t trained on Hip-Hop slang (for time reasons) and it neither recognizes irony nor citations of lyrics. Secondly, most of the time hate or antipathy is expressed by simply ignoring each other. Which means there are not that many hateful Tweets amongst the VIPs that dislike each other. This makes it hard for us to measure antipathy.
Therefore, we assume this approach fails due to lack of data and because of wrong interpretation of sentiments in tweets.



Conclusion


Based on the tweets of hip-hop artists: can we tell who is friend or foe of whom?

Tweets between hip-hop artists are a good indicator of proximity on a personal and professional level. Artists working together or friends are drawn very close to each other and build clusters as expected. However, as explained above further research needs to be done to recognize antipathy properly. The artists are spread more or less randomly.


Does the social graph based only on tweets of the fans translate to the graph we generate with the core members data?

The social graph based on tweets from the fans can be seen as a classification into different genres of hip-hop. As we have seen, band members still appear very close in the graph. They translate to a certain point on the social graph based on the artist's tweets. However, in the end, the strongest characteristic for clusters in this graph is the type of music people like, despite the people playing it not liking each other too much. In short, music is more important than personal quarrels.




Bonus: German Politicians


As input we took the list of german politicians (MdB) and added the corresponding color of the politicians' parties. As shown in the picture below, it is clustering them according to their respective parties.


Politicians



Future Possibilities


  1. DataScience 2016 Crossover
    Crawl data of 200-300 US politics and their respective democratic/republican-scores from the POC Analyzer project. Color them in their respective parties’ color and display democratic/republican-scores next to their names. Inspect if republicans who are closer to democrats in the social graph happen to also have higher democratic-scores in general (and vice versa).

  2. SentiStrength - Slang Dictionary
    Spend some time experimenting with a slang dictionary extension of SentiStrength. See if BeeF-map improves with better sentiscores.

  3. Pre Election - Post Election
    Repeatedly crawl tweets from politicians, especially pre elections and post elections. Best case the government changes after the election. This could generate some interesting insights in how politicians change their attitude towards other parties when they are in power compared to when they are not.


Code

https://github.com/MichaelCheck/CrawlerDataSience2016