Home
Blog
Natural Language Processing

Natural Language Processing

Daniel Kevins

0 comments

Q1: Define a tokenize function

which does the following in sequence:

takes a string as an input

converts the string into lowercase

segments the lowercased string into tokens. A token is defined as follows:

Each token has at least two characters.

The first/last character can only be a letter (i.e. a-z) or a number (0-9)

In the middle, there are 0 or more characters, which can only be letters (a-z), numbers (0-9), hyphens (“-“), underscores (“_”), dot (“.”), or “@” symbols.

lemmatizes all tokens using WordNetLemmatizer

removes stop words from the tokens (use English stop words list from NLTK)

generate token frequency dictionary, where each unique token is a key and the frequency of the token is the value. (Hint: you can use nltk.FreqDist to create it) returns the token frequency dictionary as the output

Q2: Find duplicate questions by similarity

A data file ‘qa.txt’ has been provided for this question. This dataset has two columns: question and answer as shown in screenshot blow. Here we only use “question” column.

Define a function find_similar_doc as follows:

takes two inputs: a list of documents as strings (i.e. docs), and the index of a selected document as an integer (i.e. doc_id).

uses the “tokenize” function defined in Q1 to tokenize each document

generates tf_idf matrix from the tokens (hint: reference to the tf_idf function defined in Section 7.5 in lecture notes)

calculates the pairwise cosine distance of documents using the tf_idf matrix

for the selected document, finds the index of the most similar document (but not the selected document itself!) by the cosine similarity score

returns the index of the most similar document and the similarity score

Test your function with two selected questions 15 and 51 respectively, i.e., doc_id = 15 and doc_id = 51.

Check the most similar questions discovered for each of them

Do you think this function can successfully find duplicate questions? Why does it work or not work? Write down your analysis in a document and upload it to canvas along with your code.

Q3 Retrieve relevant answers to questions by similarity

Each row in “qa.txt” defines a question and its corresponding answer. Now assume we do not know answers to these questions. Let’s design an algorithm to retrieve the most relevant answer to each question.

1. Define another function match_question_answer as follows:

takes two inputs: a list of questions as strings (i.e. questions), and a list of answers as strings (i.e. answers).

uses the “tokenize” function defined in Q1 to tokenize each document

generates tf_idf matrix from the tokens (hint: reference to the tf_idf function defined in Section 7.5 in lecture notes)

calculates the cosine distance between every question and every answer using the tf_idf matrix (hint, you can use scipy.spatial.distance.cdist function)

for each question q, identifies the answer which is the most similar to q as the most relevant answer (denoted as a )

returns a list of tuples each with 3 elements, (index of q, index of a , similarity score) for every question q in the dataset.

2. Define a function evaluate to evaluate the performance of retrieval as follows:

takes the returned list from match_question_answer function as an input

sets a minimum similarity threshold (denoted as min_sim), and selects entries from the list with similarity >= the threshold (denoted as matching_pairs). calculates two metrics for selected matching_pairs:

recall: the percentage of questions with matching answers, i.e.

len(matching_pairs)/len(qu estions)

precision: the precentage of questions in matching_pairs indeed matched with the corresponding answers as indicated in the dataset.

Varies the similarity threshold from 0 to 0.6 with 0.05 increase in each round, calculate the recall and precision in each round, and plot a chart with two lines where the recall and precision as Y axis and the threshold as X axis.

As the threshold increases, how precision and recal change? What can be a good similarity threshold for retrieving most relevant answers to these questions? Write down your analysis in a document and upload it to canvas along with your code.

About the Author

Daniel Kevins

Follow me