We will work with the “yelp100.csv” dataset . It is a dataset consisting
of 100000 Yelp reviews. Each review is one observation in the dataset. The following items are
recorded:
• stars: the rating (stars) given to the restaurant, on a scale from 1 (poor) to 5 (excellent).
• date: date that the review was posted.
• text: text of the review.
• useful, funny, cool: the number of other Yelp users that classified the review as “useful,”
“funny,” or “cool.”1. (4pts) For simplicity in this question only, work with 10000 observations from the dataset rather
than the whole dataset. Try to predict the rating (stars) of the review from all other information
present in the dataset. Between SVM and random forest, recommend one predictor, and explain
why you chose it.
Part of the challenge for this question is that the dataset is too large to be meaningfully parsed
by the computer, so you will need to make some simplifications. Make sure to explain your process
carefully and thoroughly. Another tip is to first run your code on a small portion of the data to
make sure that it works before running it on the whole dataset which may take hours depending
on the speed of your computer.
2. (6pts) Create your own lexicon of positive and negative words used in Yelp reviews, i.e. specifically, which words are associated with positive ratings? Which words are associated with negative
ratings?


0 comments