For this assignment you will do several analyses of a twitter stream that I recorded during a presidential debate. Each analysis should result in a graphical presentation of the results. The specific tasks you should perform are:
1) Create a nicely labeled vertical bar graph of one of the following statistics: most retweeted tweet, most quoted tweet, or most active tweeter. For one of these stats you should determine the top 10 entries in the ranking and plot either the number of retweets, number of quotes, or number of tweets respectively as a vertical bar of the appropriate height for each entry.
2) Create a nicely labeled histogram concerning information about the tweeters. For each tweeter in the stream, determine either the age of the account or the number of followers of the account. Then plot a histogram of either the account age or number of followers.
3) Create two nicely labeled scatter plots, one in 2D and one in 3D. Using the account age and number of followers information gathered in part (2), first create a 2D scatter plot of either account age vs. number of followers, or account age vs. number of tweets in the stream. Add a special annotation of the account name for the accounts with the minimum and maximum values being plotted versus age. Next create a 3D scatter plot, where the three dimensions are now account age, number of followers and number of tweets in the stream. Add some interesting annotation based upon looking at these results.
4) Create a nicely labeled plot containing two histograms. The values to histogram are determined from the sentiments of the tweets in the stream. The tweet sentiment, a value between -1 and 1, can be determined by creating a TextBlob object using the textblob module. The data for one histogram comes from the positive sentiment tweet values, the data for the other from the absolute values of the negative sentiment tweets. You can present these two histograms either as a transparent overlay of the two data sets, or as stacked histograms.
5) Create a nicely labeled horizontal bar graph of one of the following: either the number of original tweets vs. retweeted tweets in the stream, or the quantity of positive and negative sentiment tweets in the stream.
6) Create a nicely labeled line plot of the tweeting rate of the stream over the time interval spent recording the tweets.
7) Create a nicely labeled animation that depicts the change in quantity of certain keywords found in the tweets during the debate. For one quantity count how many tweets have either of the words Hillary and Clinton, for the other count the tweets containing either of the words Donald and Trump. The animation should be in the form of line plots as these two quantities change over the time interval of the debate. The two graphs should move from the left to the right of the plot as you simulate time passing during the debate.
8) Extra credit: Determine how to extract all of the tweets that are part of a conversation thread between a set of accounts in the stream. Determine the longest such thread over the course of the debate and create a display about it. You should graph a scatter/line plot that depicts which account tweeted when in this thread during the time of the debate. Essentially plot tweeting account vs. debate time, doing something to easily distinguish the different accounts in the plot (color, marker, etc.).
The data for the debate is given in a large data file (280MB) named tweetsaplenty.json stored on OneDrive and accessed through this link: https://unhnewhaven-my.sharepoint.com/:u:/g/personal/deggert_newhaven_edu/ETLHKhXAv7JNuNvKwNl3P3ABGS4KYrmZP0HeaIqjAAW-fg?e=4%3a2OZ9os&at=9. It contains a set of 22,911 tweets (there is a 22,912th record, but it is incomplete), each stored one per line in JSON format. In order to properly read in the data, you should open the file, read each line and strip off the end of line character, then use the loads() function in the json module twice to extract each tweet from a line into a dictionary.
Properly comment your code, submitted along with the graphs that are the output of your program on the debate data file.


0 comments