Directions
Complete all parts and questions below. Some starter code is supplied, along with hints throughout the instructions. If you cannot complete part 12, the remainder of the assignment can be completed without including the “LASSO model” for partial credit.
Part Zero
- Download hw6_starter.Rmd and hw6_spotify.RData and save them in the same folder.
- Knit hw6_starter.Rmd — note, it will take a few minutes to knit this document the first time (but the results are cached)
- Edit hw6_starter.Rmd to answer the following questions…
Part 1 – One-Factor Inference (8pts)
1. Build side-by-side Boxplots with overlayed means comparing the tempo of songs to the recorded key signature (key_mode in the data). Make sure the plot is properly labeled and titled. Does the distribution of tempo appear different for these key signatures? (3pts)
2. Perform a One-Way ANOVA test to compare the mean tempo as a function of key_mode. Make sure to properly state the outcome of the results – No need to check assumption or perform follow-up multiple comparisons. (2pts)
3. Does the ANOVA result agree with your analysis of the Boxplots? Discuss. (1pt)
4. Read sections 1 and 2 of the journal article “The p-value you can’t buy” (pdf). Based on that article and the analysis in #2, #3 and #4 above, describe/discuss the limitations of using an ANOVA in this setting. (2pts)
Part 2 – EDA of Popularity (8pts)
5. Construct and describe a histogram (with binwidth=1) of the popularity scores for all observations in music_for_training. (3pts)
6. Calculate the mean, standard deviation and 5-number summary for the popularity scores in music_for_training and display in a well-constructed table. (3pts)
7. Describe the overall shape and behavior of the popularity scores and what implications this may have on linear regression modeling. (2pt)
Part 3 – Model Fitting and Assessment (14pts)
8. The provided Markdown file includes a model pre-built using the music_for_training data (full_model), do not edit it. Look at the residual plots provided for the full_model, describe any concerns you may have and discuss possible transformations that may be helpful (note, you do NOT need to build a Box-Cox plot). (2pts)
9. Fit a modified full_model where the response variable has been transformed by a cube-root plus 1 transformation, that is, (popularity+1)^(1/3). No need for residual analysis. (2pt)
10. Look at the summary output from this model, what do you notice about the overall F-test and marginal t-tests. Do you think these results are particularly meaningful given the large sample size, reference Part 1 of this assignment in your discussion. (2pts)
11. Perform stepwise backward variable selection on the model from #9 (the model with a cubed root response). What modifications does it suggest? Is this surprising given the summary output and discussion in #10? (2pts)
12. Your instructors ran a LASSO regression (see Section 9.3 of the textbook for an introduction, if interested) and it suggests we do the following:
- to remove the instrumentalness variable
- Combine the key_mode “G major” and “A major” levels, call the new level “AG major” (hint: use case_when() )
- Create a new binary variable indicating if timing_signature is 3 is used or not (hint: use ifelse() )
Build a new model based on these changes and provide the summary() output, what do you notice compared to your other models. (4pts)
13. Build a table that reports the adjusted R-squared, AIC and BIC for the full_model, the transformed response model in #9 and the LASSO-variable selection model build in #12. (1pt)
14. You should note that the AIC and BIC values for the transformed models are substantially smaller than the full_model, discuss why it is unfair to compare the AIC and BIC of the models with a transformed response (cube root of response) to the others (not transformed)? Hint: it has to do with the RSS; see Module 09and Section 6.4.2 of the text. (2pts)
15. Based on the two models with a transformed response, which model appears to be the best fit? Justify with a brief discussion. (2pts)
Part 4 – Model Validation and out-of-sample Prediction (10pts)
16. Use the 3 models from Part 3 (the full_model, the transformed response model and the LASSO-based model you built in #12) to predict the popularity scores for the 10,000 songs in the music_for_testing data. Note: you will need to perform the same mutations to music_for_testing as you did for music_for_training in part 12. (3pts)
17. Calculate the root mean squared error (RMSE) for each of the three models (make sure to “un”-transform the response), which model appears best at predicting popularity scores? (2pts)
18. Compare/contrast the RMSE values in #17 to the standard deviation calculated in #6 and the residual standard error of the full_model fit in #9. What does this imply about these models to predict popularity scores? (1pt)
19. Use the best model of the five to predict the popularity scores for the 5,000 songs in the music_to_predict data. Determine which 10 tracks have the highest predicted popularity scores. (2pts)
20. Using the best model from #17, explore the distribution of your predicted popularity scores. Discuss why the predicted popularity scores behave as they do. Do these predictions appear surprising given the distribution of the popularity scores? (2pts)


0 comments