Natural Language Processing
Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py
Use text below to apply the same process
Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””
a. Text Analysis Operations using NLTK
b. Tokenization
c. Stopwords removal
d. Lexicon Normalization such as Stemming and Lemmatization
e. POS Tagging
Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv
Explain each step for the following text clean-up commands
a. Explain each step for the following text clean-up commands
review = dataset[‘Review’][0]
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
b. What is the classification question?
c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:
TP = # True Positives,
TN = # True Negatives,
FP = # False Positives,
FN = # False Negatives):
Accuracy = (TP + TN) / (TP + TN + FP + FN)
d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy
Q3 NLTK Corpus on Movie Reviews
Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”
https://www.nltk.org/book/ch06.html
Q3b – Explain how the Bag of Words model help in sentiment analysis
http://blog.chapagain.com.np/python-nltk-sentiment…
Summarize the entire code in NLTKMovieReview.py file as a part of the solution
Q4 Twitter Analysis sentiment140
Perform a Twitter sentiment analysis –
– who interact by retweeting and responding?
– Twitter employs a message size restriction of 280 characters or less
– forces the users to stay focused on the message they wish to disseminate.
– Twitter data is great for Machine Learning (ML) task of sentiment analysis.
– Sentiment Analysis falls under Natural Language Processing (NLP)
– made up of about 1.6 million random tweets
– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.
https://towardsdatascience.com/the-real-world-as-s…
Q5 Analyze Clothing Reviews
https://www.kaggle.com/nicapotato/womens-ecommerce…
A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Class Name: Categorical name of the product class name
Perform
a. Text extraction & creating a corpus
b. Text Pre-processing
c. Create the DTM & TDM from the corpus
d. Exploratory text analysis
e. Feature extraction by removing sparsity
f. Build the Classification Models and compare Logistic Regression to Random Forest regression
https://medium.com/analytics-vidhya/customer-revie…
Q6 Sentiment analysis on Trump and Hillary tweets (Optional)
https://www.kaggle.com/pavanraj159/sentiment-analy…
a. Import Data and Data manipulation
b. Percentage of retweets
c. Languages used in tweets
d. Original authors of retweets
e. tweets by month
f. import positive and negative words dictionaries
g. Scoring tweets & distribution
h. Sentiment distribution of tweets and correlation matrix
i. Popular hashtags & wordcloud – hashtags
j. Popular twitter account reference & wordcloud
k. popular words in tweets, word cloud – popular words
l. Popular positive and negative words used by trump
m. Hashtag references by twitter accounts & Account references by hashtag
n. Positive & Negative word references by Trump
o. Sentiment of tweets by hour of day
p. Favorite and retweets by sentiment, Average retweets and favorites by sentiment
Q7 Sentiment Analysis of Movie Reviews – Dataset available on Kaggle (Optional)
https://www.kaggle.com/c/sentiment-analysis-on-mov…
Dataset has four columns PhraseId, SentenceId, Phrase, and Sentiment. This data has 5 sentiment labels: 0 – negative 1 – somewhat negative 2 – neutral 3 – somewhat positive 4 – positive
Perform Naïve Bayes Classification using scikit-learn.
Q8 Analyze state of the Union Address (Optional)
com/rtatman/state-of-the-union-corpus-1989-2017″>https://www.kaggle.com/rtatman/state-of-the-union-…
The State of the Union is an annual address by the President of the United States before a joint session of congress. In it, the President reviews the previous year and lays out his legislative agenda for the coming year This dataset contains the full text of the State of the Union address from 1989 (Regan) to 2017 (Trump).
a. Topic modelling: Which topics have become more popular over time? Which have become less popular?
b. Sentiment analysis: Are there differences in tone between different Presidents? Presidents from different parties?
Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py
Use text below to apply the same process
Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””
a. Text Analysis Operations using NLTK
b. Tokenization
c. Stopwords removal
d. Lexicon Normalization such as Stemming and Lemmatization
e. POS Tagging
Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv
Explain each step for the following text clean-up commands
a. Explain each step for the following text clean-up commands
review = dataset[‘Review’][0]
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
b. What is the classification question?
c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:
TP = # True Positives,
TN = # True Negatives,
FP = # False Positives,
FN = # False Negatives):
Accuracy = (TP + TN) / (TP + TN + FP + FN)
d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy
Q3 NLTK Corpus on Movie Reviews
Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”
https://www.nltk.org/book/ch06.html
Q3b – Explain how the Bag of Words model help in sentiment analysis
http://blog.chapagain.com.np/python-nltk-sentiment…
Summarize the entire code in NLTKMovieReview.py file as a part of the solution
Q4 Twitter Analysis sentiment140
Perform a Twitter sentiment analysis –
– who interact by retweeting and responding?
– Twitter employs a message size restriction of 280 characters or less
– forces the users to stay focused on the message they wish to disseminate.
– Twitter data is great for Machine Learning (ML) task of sentiment analysis.
– Sentiment Analysis falls under Natural Language Processing (NLP)
– made up of about 1.6 million random tweets
– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.
https://towardsdatascience.com/the-real-world-as-s…
Q5 Analyze Clothing Reviews
https://www.kaggle.com/nicapotato/womens-ecommerce…
A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Class Name: Categorical name of the product class name
Perform
a. Text extraction & creating a corpus
b. Text Pre-processing
c. Create the DTM & TDM from the corpus
d. Exploratory text analysis
e. Feature extraction by removing sparsity
f. Build the Classification Models and compare Logistic Regression to Random Forest regression
https://medium.com/analytics-vidhya/customer-revie…
Q6 Sentiment analysis on Trump and Hillary tweets (Optional)
https://www.kaggle.com/pavanraj159/sentiment-analy…
a. Import Data and Data manipulation
b. Percentage of retweets
c. Languages used in tweets
d. Original authors of retweets
e. tweets by month
f. import positive and negative words dictionaries
g. Scoring tweets & distribution
h. Sentiment distribution of tweets and correlation matrix
i. Popular hashtags & wordcloud – hashtags
j. Popular twitter account reference & wordcloud
k. popular words in tweets, word cloud – popular words
l. Popular positive and negative words used by trump
m. Hashtag references by twitter accounts & Account references by hashtag
n. Positive & Negative word references by Trump
o. Sentiment of tweets by hour of day
p. Favorite and retweets by sentiment, Average retweets and favorites by sentiment
Q7 Sentiment Analysis of Movie Reviews – Dataset available on Kaggle (Optional)
https://www.kaggle.com/c/sentiment-analysis-on-mov…
Dataset has four columns PhraseId, SentenceId, Phrase, and Sentiment. This data has 5 sentiment labels: 0 – negative 1 – somewhat negative 2 – neutral 3 – somewhat positive 4 – positive
Perform Naïve Bayes Classification using scikit-learn.
Q8 Analyze state of the Union Address (Optional)
https://www.kaggle.com/rtatman/state-of-the-union-…
The State of the Union is an annual address by the President of the United States before a joint session of congress. In it, the President reviews the previous year and lays out his legislative agenda for the coming year This dataset contains the full text of the State of the Union address from 1989 (Regan) to 2017 (Trump).
a. Topic modelling: Which topics have become more popular over time? Which have become less popular?
b. Sentiment analysis: Are there differences in tone between different Presidents? Presidents from different parties?
HW11.docx
Q2 Restaurant Reviews.zip
Q1 NLP Basics.zip
By submitting this paper, you agree: (1) that you are submitting your paper to be used and stored as part of the SafeAssign™ services in accordance with the Blackboard Privacy Policy; (2) that your institution may use your paper in accordance with your institution’s policies; and (3) that your use of SafeAssign will be without recourse against Blackboard Inc. and its affiliates.


0 comments