data mining clustering

Daniel Kevins

0 comments

gives corporate data on 22 public utilities in the United States. We are interested in

forming groups of similar utilities. The records to be clustered are the utilities, and the clustering

will be based on the eight measurements of each utility.

The features are:

Fixed = fixed-charge covering ratio (income/debt)

RoR = rate of return on capital

Cost = cost per kilowatt capacity in place

Load = annual load factor

Demand = peak kilowatt-hour demand growth from 1974 to 1975

Sales = sales (kilowatt-hour use per year)

Nuclear = percent nuclear

Fuel Cost = total fuel costs (cents per kilowatt-hour)

Please Load the data as a Panda dataframe, set row names (index) to the utilities column

(company). Convert all columns to float.

a. Use “from sklearn.metrics import pairwise” and calculate the pairwise Euclidean

distance between each pair of Utilities and show the distance matrix.

b. Standardize the features based on mean and std and recalculate the pairwise

distance matrix using Euclidean distance

For the rest of the tasks, use the normalized version of the dataframe.

a. Use “from scipy.cluster.hierarchy import linkage” and plot the Dendrogram using

the Single linkage

b. plot the Dendrogram using Average linkage

c. use “from scipy.cluster.hierarchy import fcluster” and apply it to Dendrograms for

both Single and Average linkages to separate the data points into 6 clusters and

print the clusters with their corresponding members. (Set the criterion=’maxclust’

for the fcluster)

a. Use “from sklearn.cluster import KMeans” to cluster the data into 6 clusters. Set

the random state for KMeans to “0”. Print the clusters and their members.

b. For the number of clusters from 1-7, plot the average SSE vs the number of

clusters as a line plot. Use “intertia” attribute of KMeans to get the SSE. Make

sure that you divide it by the number of clusters to get the average SSE.

About the Author

Follow me