Home
Blog
Python program about LDA Parameter Tunning

Python program about LDA Parameter Tunning

Daniel Kevins

0 comments

Q1：LDA Parameter Tunning

Define a function tune_lda() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json

Fits LDA models (from gensim package) using documents from train_f ile with different parameter values:

Number of clusters (K ) from 2 to 6

Topic distribution prior (i.e. α): ‘symmetric’ (i.e. α = [1, 1, 1, . . . ] ), ‘asymmetric’ (i.e.

α = [1/K, 1/K, 1/K, . . . ] ), and ‘auto’ (i.e. the prior is calculated based on word frequency)

With all parameter combinations, in total, you’ll train 15 LDA models. When fitting each model, set the maximum number of iterations to 40 to make sure your model converges. Note, it may take a few minutes to train all models.

For each model, calculate topic coherence using ‘u_mass’ formula. The details of coherence can be found at https://radimrehurek.com/gensim/models/coherencemo… (https://radimrehurek.com/gensim/models/coherencemo…). Read the paper referenced in the link to make sure you understand the meaning of topic coherence. Note, ‘c_v’ instead of ‘u_mass’ is recommended to evaluate topic coherence. For simplicity, let’s use ‘u_mass’ here. However, if you can figure out how to use ‘c_v’, that’s even better.

Create a plot to show how topic coherence changes as the K increases under different α values (i.e., a line for each α value).

Based on the plot, determine best K and α values in terms of topic coherence

This function does not have a return. Write a document to show:

best parameter combination in terms of topic coherence

do you think topic coherence is a good metric for you to choose K ?

About the Author

Daniel Kevins

Follow me