The main purpose of this assignment is for you to develop a case study and a write-up that tells a compelling business data-analytics story using the PPDAC framework. It should present a data analysis that provides a clearly explained answer to a question to help inform and improve business decision-making. You should therefore choose a question that is (a) decision-relevant; and (b) answerable using available data. Examples of possible case study topics include the following:
·Identify what independent variables an outcome of interest depends on (e.g., a financial impact or cost-effectiveness measure, service quality rating, employee engagement, process metric, or other KPI).
·Test null hypotheses that specified outcome variable(s) are independent of one or more independent variables.
· Quantify how well the values of some variables can be predicted from the values of others (e.g., using the R2 of a regression model).
·Compare different predictive models (e.g., Multiple Linear Regression (MLR) using lm() vs. CART using rpart() vs. random forest or other machine-learning (ML) models) to determine which provides the most accurate predictions for an outcome of interest.
· Use Markov models, Poisson regression, or other appropriate probability models to predict how different members of a population receiving nudges (e.g., to enroll in a program, change a behavior, adopt a new product or service, remember to make payments, etc.) will respond over time.
·Use forecasting models to predict diffusion of a new technology or product over time, and to predict how different introduction and marketing strategies may affect adoption rates.
You should formulate an interesting question that can be addressed through analysis of one or more data sets (the first P step in PPDAC).
·Your problem statement should explain the underlying question or issue and why it matters (what is at stake, what decisions need to be made that the data analysis will inform).
·State how the question will be addressed by testing one or more null hypotheses or by estimating one or more quantities from data (covered in planning, the second P in the PPDAC cycle).
oCommon null hypotheses to test are that
§2 quantities are independent (e.g., that purchase probability does not depend on customer age; or that sales volume does not depend on advertising expenditure);
§2 or more quantities have the same distribution (e.g., that men and women have the same admission probability to a program); or
§The empirical distribution of a quantity matches a specified theoretical distribution or has specified properties (e.g., that sales volumes are normally distributed, or that the mean change in customer satisfaction from before to after an information campaign is 0).
§If the null hypothesis of independence is rejected, then it remains to describe how one variable depends on others.
oCommon quantities to estimate are regression slope coefficients (in SLR or MLR), differences in means across 2 groups, changes in means for individuals exposed to some intervention, conditional probabilities or conditional expected values of a response variable given individual attributes (as in regression, CART tree, and random forest analyses), and linear or ordinal correlations between 2 variables.
This assignment should focus on questions that can be addressed using readily available datasets, such as the business-related ones we examine in class, using analytics techniques that we have studied or that you research for the project. You may choose any of the techniques we have studied (or others), but your Analysis section (A step in PPDAC) should explain why you chose the analysis methods you did, what assumptions the hypothesis-testing and estimation methods you applied make, how (and whether) those assumptions were tested and validated with data, and the results of the validation tests.
·For a selection of data sets available to analyze, use library(MASS), library(car), library(forecast), and library(diffusion) to load these packages (after installing them using install.packages(“”) if needed). Then use data() to view the datasets available to you in all loaded packages, as well as the built-in datasets in R.
The main goal is to demonstrate an application of the PPDAC cycle in which an interesting decision-relevant question is answered using data. Your write-up should clearly describe the following:
1.Problem: The question or problem addressed. (Why does it matter/ why do we care? How do business decisions or outcomes depend on the answers? What null hypothesis or hypotheses did you test? What quantities did you estimate? Why?)
2.Plan: Your plan for addressing the question or problem with data (e.g., what variables are relevant and available? Are other relevant variables missing? Are the variables you have sufficient to answer the question, perhaps with the help of assumptions?)
3.Data: Data characteristics, e.g., the number and types of variables used, their meanings and relationships to each other and to the problem being addressed, the number of observations or cases examined, the sampling plan used to collect the data (if any), treatment of missing data (if any), and limitations of data collection and coding. (You may discuss multicollinearity and variable selection and rescaling or recoding of variables, if any, in either the Data section or the Analyses section. These steps may not be needed, but please discuss them if you use them.)
4. Analyses, e.g., selection of variables to focus on and models to describe the relationships among them; probability model parameter estimation, confidence intervals and hypothesis testing with bootstrap, permutation test, or parametric modeling methods; exploration and visualization using summary statistics, distributions, box plots, and scatter plots, CART trees, random forest and other machine learning (ML) partial dependence plots (PDPs), ICE plots and clusters, or time series; estimation and hypothesis testing using confidence intervals; prediction using probability distributions or simple and multiple linear regression models or non-parametric alternatives such as PDPs. Analyses sections may show results of both exploratory analyses (e.g., importance plots, CART trees) and confirmatory analyses (e.g., results of quantitative hypothesis tests or estimates of measures of association).
5.Conclusions: The answer to the question(s) in the problem statement should be clearly presented, along with any caveats about assumptions and limitations of the analysis any limitations of the data, and what additional information might be useful (if any) if the answer is not definitive based on the available data and the analyses you have done.
Additional examples of the kinds of business problems that you can use to illustrate the PPDAC cycle for this assignment include the following:
·Estimate how sales (or enrollment, subscription, etc.) volumes varies with advertising expenditures in different media, for different subsets of customers (e.g., based on age, sex, and other attributes);
·Estimate how the proportion of satisfied customers depends on different customer attributes and experiences, using survey data. Is the proportion significantly different before vs. after a change in procedures?
·What are the key drivers of customer satisfaction? (For these, you should be able to confidently reject the null hypothesis that customer satisfaction does not depend on their values.) How strongly do they predict customer satisfaction?
·What factors help to predict employee attrition, and how is attrition related to employee performance? (Googling “employee satisfaction data set download” leads to https://www.analyticsinhr.com/blog/hr-data-sets-pe…and other data sets.)
A suggested template for writing up case studies is as follows:
1.INTRODUCTION (P and P)
- What question(s) did you address? (Why?)
- What data are needed to address this question?
- What data are available?
- How are they used to answer the question?
- What null hypothesis (or hypotheses) did you test?
- What quantities did you estimate?
- DATA (D)
- Describe the data obtained (variables and their types)
- Number of variables, types of measurement scales
- Study design (if any), sampling plan (if any), data sources
- Limitations of available data (e.g., missing variables and data)
- Limitations of study design (e.g., non-random sampling, sample selection biases, variables not collected), and data collection (e.g., variable coding)
- Treatment of missing data
- If there are missing data values, what did you do about them(and why)? Did you drop them (e.g., using df <- na.omit(df), or using na.omit = TRUE in your functions), or impute them?
- Did you test the null hypothesis that values are missing completely at random (MCAR)? (https://cran.r-project.org/web/packages/finalfit/v…).
- Variable selection
- Did you select a subset of variables to analyze? If so, why and how (e.g., based on importance rankings, conditional independence tests (CART Trees), automated variable-selection algorithms, tests for multicollinearity (VIFs), or something else)?
- Conversely, did you drop some variables? If so, why?
- Pre-processing and feature engineering
- Did you rescale or recode the variables?
- Did you combine variables, use Principal Components Analysis, or otherwise change the original variables and data?
- Describe the data obtained (variables and their types)
- ANALYSIS AND RESULTS (A)
- Explore the data. Provide summary statistics and graphs, visualize with scatter plots etc., perform regression modeling and other analyses
- Test the null hypotheses specified in your Introduction. What are the p-values of the tests? What assumptions do the tests make, and were they tested?
- Estimate the quantities specified in your introduction.
- Report results of any model diagnostics and validation tests for the analyses you have done (e.g., if you used an SLR or MLR model, do Q-Q plots and other diagnostics indicate that the linear regression assumptions are not violated)?
- DISCUSSION, INTERPRETATION, AND CONCLUSIONS (C)
- Communicate the conclusions/results/findings.
- What was the answer to the question?
- What are the limitations of the study?
- Uncertain assumptions, imperfect study design, imperfect data, remaining uncertainties
- Interpretation: Decision recommendations and remaining uncertainties
- Communicate the conclusions/results/findings.
Grading
Grading of this assignment is based primarily on the clarity with which you explain the following:
·The problem or question you addressed and the null hypotheses tested and quantities estimated;
·How you answered it using analyses of relevant data (plus modeling assumptions if needed, e.g., if you used linear regression);
·What your conclusions were;
·How confident you are in your conclusions (and why); and
·Any remaining uncertainties and limitations.
Your write-up need not be long (e.g., it might be 5-10 few pages if you follow this template, but there is no length requirement), but it must be clear about these points.
The rubric for grading includes the following:
·Problem statement: Were the null hypotheses tested and/or the quantities estimated clearly stated? Was the underlying business question (or scientific question, technology question, behavior question, or other applied research question) being addressed identified? Were the business (or other) motivations for addressing it explained, so that the case study is well motivated?
·Data and study design: Were appropriate data selected for addressing the question(s) in the problem statement? Were the data sources, variables and dataset(s) used clearly described? Does the discussion make it easy for a reader to download the same data and verify the analyses and results? Were limitations of the available data (e.g., omitted variables, missing data, inaccurate measurements, potential confounders, ambiguous coding of data values, etc.) discussed?
·Data exploration and visualization: Were appropriate exploratory plots and visualizations used to seek and display relevant patterns in the data? Were importance plots (in random forest, Boruta(), rpart(), or other packages), correlation networks or visualizations, stepwise variable selection in regression modeling, or other techniques used to help select key variables to include in the analysis?
·Analyses: Were appropriate models and methods used to analyze the data, including appropriate multivariate and non-parametric analyses (e.g., CART trees randomForest, PDPs, etc.)? Were the assumptions of the analysis clearly identified? Were they tested (e.g., using regression diagnostics and gvlma() for regression models, Q-Q plots, Shapiro-Wilk tests for normality, if normality was assumed, etc.)? Were the results of the tests shown and their implications for the validity of modeling assumptions (if any) clearly stated? Were multiple methods and nonparametric analyses used to establish that findings are robust, i.e., not dependent on specific modeling assumptions?
·Conclusions: Are the conclusions stated clearly? Are answers to the original question(s) in the problem statement presented, together with any appropriate caveats (e.g., identifying any unverified modeling assumptions on which the answers depend)? Are answers stated correctly (e.g., with exact p-values for tests of null hypotheses, rather than imprecise terms such as “significant”?) Are interpretations of the results and their implications correct, e.g., with association and causation clearly distinguished if necessary? Are limitations of the study and robustness of its conclusions discussed? Do suggestions for next steps follow from the insights gained in this application of the PPDCA cycle to the data?
Some examples of explorations, visualizations, and analyses of datasets
https://www.r-bloggers.com/2018/10/exploring-hr-em…
https://rstudio-pubs-static.s3.amazonaws.com/28601…
https://www.kaggle.com/ragulram/hr-analytics-explo…
https://eugenividal.github.io/docs/CrimeChicago.ht…
https://www.r-bloggers.com/2019/09/heart-disease-p…
https://rpubs.com/thanrajks/med-ana
Typical feedback
Whenever you use assumption-dependent methods such as a linear regression model or a t-test, you should report how you tested whether the underlying assumptions are satisfied (e.g., by using gvlma() and shapiro.test() ), and what the results were. You can reduce the number of assumptions that need to be checked by using nonparametric methods (e.g., Spearman’s rank correlation instead of Pearson’s linear correlation; or CART trees and random forest with PDPs instead of linear regression).
1.INTRODUCTION (P and P)
1. What question(s) did you address? (Why?):
To identify there were any predictive factors for patients who develop recurrence of pilonidal
disease following cleft lift surgery. Pilonidal disease is a debilitating
condition seen in young men and women there is associated with significant
pain, debilitation, loss of productive work etc. Numerous treatment options are
available without any clear-cut superiority. One of the treatment options that
has been proposed to be associated with better outcomes is excision of
pilonidal disease with cleft lift. Data on outcomes after cleft lip surgery is
not widely available in the United States population although this is available
in European literature. We aimed to identify the predictive factors for
recurrence following cleft lift procedure.
2. What data is needed to address this question?
Clinical data from patients who have undergone cleft lift surgery is essential. Ideally
this would be prospectively collected and multi-institutional information
containing patient demographics, comorbidity status, smoking, prior operations
including interventions such as abscess drainage, postoperative outcomes
including wound breakdown and recurrence.
3. What data is available?
Retrospectively collected data set from a single institution is available. Information from
large data sets such as registries or multi-institutional databases is not
available on this pertinent topic.
4. How are they used to answer the question?
With the information that is available, we tried to identify any predictive factors
associated with recurrence after cleft lift surgery.
5. What null hypothesis (or hypotheses) did you test?
We hypothesize that patient factors and prior surgical interventions were not
associated with increased risk of recurrence following cleft lift surgery.
6. What quantities did you estimate?


0 comments