GSU Text Analytics Project

0 comments

Turn in a Jupyter Notebook that performs the steps listed below. Do not turn in this document. Comment your code fully and remove any extraneous code. Your homework should be the final “production-ready” version of your code, with all false starts and experimentation removed. Make it easy for me to follow what you did. I will penalize for sloppy and/or unclear code that I cannot understand.

Include answers to anything marked “question” below as commented code in your Jupyter Notebook. Anything market “output” below should be generated inside your Jupyter Notebook when I run your code.

1.Read in and subset the Enron email corpus: n = 100; random_state = 100.

2.Extract the month and body of the emails.

3.Questions:

a.How many emails have missing data in the ‘month’ column?

b.How many emails were dated the month of December?

c.How many emails were dated the month of June?

4.Output: Generate a bar plot showing the number of emails per month.

5.Tokenize using NLTK.

6.Perform the following cleaning steps:

a.Lowercase

b.Drop NLTK or SpaCy stopwords (explain your choice in comments in the code)

c.Drop punctuation and numbers.

7.Drop any additional characters or words at your discretion, but explain your choices in comments in the code.

8.Output: Generate a next word predictor based on bigrams within sentences. Include your most interesting result in comments in the code.

9.Output: Create two horizontal bar plots showing the top 30 nouns in emails dated December (all years) versus June (all years). Compare and interpret in comments in the code.

10.Output: Create two TF-IDF word clouds (max 750 words) showing TF-IDF-weighted words in emails dated December (all years) versus June (all years). Compare and interpret in comments in the code.

# Load the full Enron email corpus from the Kaggle link : https://www.kaggle.com/wcukierski/enron-email-data

About the Author

Follow me


{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}