For this project, we’re going to use cluster analysis to “tell a story” about our data. I’m asking you to divide the Oregonians in your sample into groups or clusters based on two quantitative variables. Your “story” will be an explanation of your data that highlights some interesting feature(s) or makes a point about the data.

Please note this project will likely take some trial and error. Please relax into it and have some fun with the process: think of it as an exploration. Trial and error is the spice of life!

You will begin by opening up the OregonPUMS data set. Take a small subset of this data (I recommend n=400, so as not to upset XLSTAT too much: some clustering algorithms grind to a halt with large data sets).

Process:

Step 1: Select your sample of n=400. Lucky for us, XLSTAT is quite good at taking a random sample. Check out: Simple Random Sampling in XLSTAT

Step 2: Choose two QUANTITATIVE variables that you would like to work with. Copy and paste your two variables and their corresponding sampled data (there should be 400 rows of data, two columns) into a new sheet. I prefer to do this so that I am not overwhelmed by variables that I am not using. Next, remove any rows with missing observations. This will save time later when you go to plot your clusters.

For the following steps, be sure that you’ve logged into OSU’s remote desktop so you can make use of the XLSTAT add-in. Click on the XLSTAT tab on the top of your Excel sheet.

Step 3: Use different options in the software to create 5 different “data stories”: if you’re overwhelmed about what to pick, you can use these options:

*Scatterplots will have to be created separately using the Results by Object output. Under the colors tab, use whatever colors you would like, but be sure they are bold and distinct. For example, it would be a bad idea to use white or both red and red-orange.

Step 4: Write up your project! Which clustering method out of the five did you prefer? Why?

For your final report, compare and contrast each of the five clustering methods. You may choose to use your XLSTAT output or use Tableau/other software to make a prettier graph. Tell your story using your preferred clustering method, and how the clustering supports that story. Who are these groups? What does this clustering tell us about the people in Oregon?

To really impress, give a little flavor! Describe a set of particular individuals who exemplify each cluster.

Rubric for Project (40 points)

15 points: at least 5 different graphs, all using the same basic variables (Step 1) but different clustering choices (Step 3). Data process and data product both discussed, particularly for Method 5.

10 points: your narration of the progression of your thinking (data process story).

5 points: Instructor’s subjective take on the product story. Was it gripping, interesting, well done?

5 points: graph conventions, labels, etc.

5 points: conventions: correct punctuation, sentences, etc.

5 points for early submission! Early submissions due by October 21st at 11:59 PM.