Using the data set here: com/tobycrabtree/nfl-scores-and-betting-data”>https://www.kaggle.com/tobycrabtree/nfl-scores-and…, complete the following tasks in either R or Python.
Our experimental goal:
Feature Selection:
Run randomForest (from RandomForest package) and run varImp and varImpPlot.
1.1) Using the top 10 features from varImp run classification GLM(family=binomial)/vglm (family=multinomial) or multinom from nnet package.
1.2)Using any classification performance metric
compute the performance using those top 10 features.
You can use forward or backward selection.
These two concepts have not been covered in class. I am expecting you to read up.
2) Tabulate Null Deviance/Residual Deviance in GLM and use the set of features that reduce
the residual deviance by the most.
3) Compare the best set of features as determined in step 1.1 with those determined by Residual Deviance
—————performance metric——————
for the three algorithms logistic,naiveBayes,kNN
1) For each classifier estimate AUC, Accuracy, Precision, Sensitivity
specificity and recall over training data during learning phase and test data *generalization phase*
2) Construct a confusion matrix and plot ROC.for the learning phase and generalization phase
3) Estimate variance using 30% of the data, 60% and 100% of the data.


0 comments