I’m working on a r presentation and need an explanation to help me understand better.
5.1 University Rankings. The dataset on American College and University Rankings (available from www.dataminingbook.com) contains information on 1302 American are 17 measurements, including continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or public school).
Note that many are missing some measurements. Our first goal is to estimate these missing values from ” similar ” records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.
a. Remove all records with missing measurements from the dataset.
b. For all the continuous measurements, run hierarchical clustering using complete linkage and Euclidean distance. Make sure to normalize the measurements. From the dendrogram : How many clusters seem reasonable for describing these data?
c.Compare the summary statistics for each cluster and describe each cluster in this content (e.g., “Universities with high tuition, low acceptance rate…”). Hint: To obtain cluster statistics for hierarchical clustering, use the aggregate () function.
d. Use the categorical measurements that were not used in the analysis (State and Private/Public) to characterize the different clusters. Is there any relationship between the clusters and the categorical information?


0 comments