Utilities.csv (https://github.com/GauthamBest/Training_Data/blob/…)
gives corporate data on 22 public utilities in the United States. We are interested in
forming groups of similar utilities. The records to be clustered are the utilities, and the clustering
will be based on the eight measurements of each utility.
The features are:
Fixed = fixed-charge covering ratio (income/debt)
RoR = rate of return on capital
Cost = cost per kilowatt capacity in place
Load = annual load factor
Demand = peak kilowatt-hour demand growth from 1974 to 1975
Sales = sales (kilowatt-hour use per year)
Nuclear = percent nuclear
Fuel Cost = total fuel costs (cents per kilowatt-hour)
Please Load the data as a Panda dataframe, set row names (index) to the utilities column
(company). Convert all columns to float.
1)
a. Use “from sklearn.metrics import pairwise” and calculate the pairwise Euclidean
distance between each pair of Utilities and show the distance matrix.
b. Standardize the features based on mean and std and recalculate the pairwise
distance matrix using Euclidean distance
For the rest of the tasks, use the normalized version of the dataframe.
2)
a. Use “from scipy.cluster.hierarchy import linkage” and plot the Dendrogram using
the Single linkage
b. plot the Dendrogram using Average linkage
c. use “from scipy.cluster.hierarchy import fcluster” and apply it to Dendrograms for
both Single and Average linkages to separate the data points into 6 clusters and
print the clusters with their corresponding members. (Set the criterion=’maxclust’
for the fcluster)
3)
a. Use “from sklearn.cluster import KMeans” to cluster the data into 6 clusters. Set
the random state for KMeans to “0”. Print the clusters and their members.
b. For the number of clusters from 1-7, plot the average SSE vs the number of
clusters as a line plot. Use “intertia” attribute of KMeans to get the SSE. Make
sure that you divide it by the number of clusters to get the average SSE.


0 comments