1) Answer the following to the best of your ability: (10 points)
a) Define Corpus:
b) How might you make a corpus for the following problem: I want to be able to learn characteristics of a politician’s language
2) a) Describe briefly 4 difficulties with identifying word boundaries algorithmically? (8 points) i)
ii)
iii)
iv)
b) What is the possible differences in the following two implementations of a word identifier (5 points) tokens = nltk.word_tokenize(sentence) and tokens = sentence.split(“ “)
c) Why do we use ‘tokens’ instead of ‘word’ (5 points)
3) With the following sentence “The Cat in the Hat” (12 points)
a) List the Uni-grams b) List the Bi-grams c) List the Tri-grams 4) Answer the following about predictive models: (10 points) a) What is a backoff model?
b) Give an example of how a backoff may help your model.
5) Why do we need sent_tokenize_list = sent_tokenize(text) in NLTK instead of just breaking sentences apart by punctuation? (5 points)
6) Briefly explain Transformation Based Tagging and how it differs from Ngram tagging for Part-of_Speech (8 points)
7) Answer the following: (12 points)
a) What is a False Negative?
b) What is a True Positive?
c) When should Accuracy be used as a metric?
d) What is the difference between Precision and Recall? When would you use them? 8) Fill in the 3 empty boxes for a typical machine learning cycle: (9 points)
9) What are the two differences when you test on your training data versus testing on your test data? (4 points)
10) Explain (or draw) k-fold validation when k=5 (6 points)
11) Show 3 examples where a Named Entity System by get confused by ambiguity (6 points)


0 comments