Home
Blog
San Jose State University Clustering Algorithms Responses

San Jose State University Clustering Algorithms Responses

Daniel Kevins

0 comments

Answer 1

What are the characteristics of data?

Accuracy -The term “accuracy” refers to the degree to which information accurately reflects an event or object described.

Completeness-Data is considered “complete” when it fulfills expectations of comprehensiveness.

Uniqueness- “Unique” information means that there’s only one instance of it appearing in a database (Rastin & Matei, 2018).

Compare the difference in each of the following clustering types: prototype-based, density-based, graph-based.

Prototype Based Cluster:

If the data is numerical, the prototype of the cluster is often a centroid i.e., the average of all the points in the cluster.
If the data has categorical attributes, the prototype of the cluster is often a medoid i.e., the most representative point of the cluster.
Objects in the cluster are closer to the prototype of the cluster than to the prototype of any other cluster (Rastin & Matei, 2018).
Prototype based clusters can also be referred to as Center-Based Clusters.
These clusters tend to be globular.

Graph-Based Clusters (Contiguity — Based Clusters)

Two objects are connected only if they are within a specified distance of each other.
Each point in a cluster is closer to at least one point in the same cluster than to any point in a different cluster.
Useful when clusters are irregular and intertwined.
This does not work efficiently when there is noise in the data, as shown in the above picture, a small bridge of points can merge two distinct clusters into one (Rastin & Matei, 2018).

Density-Based Clusters:

Cluster is a dense region of objects that is surrounded by a region of low density.
Density based clusters are employed when the clusters are irregular, intertwined and when noise and outliers are present.
Points in low density region are classified as noise and omitted. The above picture can be compared with the picture under “Graph-Based clustering” for better understanding. The bridge between two circles and another small curve are eliminated (Rastin & Matei, 2018).

What is a scalable clustering algorithm?

The growing size of data sets and the inability of clustering algorithms to scale have attracted attention to distributed clustering as a method of partitioning massive data sets. We suggest an algorithm for clustering large-scale data sets without clustering any of the data at once. Data separate into almost equal-sized disjoint subsets at random—the hard-k means or fuzzy k-means algorithm use to cluster each subgroup (Sharda, Delen, & Turban, 2020). The correspondence problem among the ensemble of centroids is solved transitively by a centroid correspondence algorithm. A global collection of centroids generates by combining the centroids. The pattern of clusters produced by our algorithm is most of the time identical to the design of groups generated by clustering all of the data at once, according to experimental findings.

References

Rastin, P., & Matei, B. (2018, July). Prototype-based clustering for relational data using barycentric coordinates. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.

Sharda, R., Delen, D., & Turban, E. (2020). Analytics, Data Science, & Artificial Intelligence. Pearson.

—————————————————————————————————————————————-

Answer 2

What are the characteristics of data?

Since then, there has also been a significant advancement in data integration and processing. A new industry based on digital information processing has evolved, servicing both commercial and public sector organisations. A database is a collection of useful data that serves as the foundation for data processing and consolidation. One of the most essential points to note is that not enough data is of high quality, limiting its value. To fully reap the rewards of data, it must be of good quality. This indicates that specific features in the data should be looked for. A person or entity in possession of structured information has enormous influence. As a result, it is critical to comprehend what data is and its features. Data is defined in computing as any sort of information that has already been acquired and arranged in a meaningful way so that it may be processed further. In these other terms, data are considered facts with underlying theme that may be recorded. One of the most fundamental database topics that you really should understand is data definition and attributes (Mikut & Reischl, 2011).

Compare the difference in each of the following clustering types: prototype-based, density-based, graph-based.

Data analysis is a prevalent approach in current science research, including communication studies, computer science, and biological science. Clustering, as the fundamental component of data analysis, is important. On just one hand, various techniques for cluster analysis had also been developed, as has the rise in knowledge and topic overlap. Given the complexity of data, each clustering method, but at the other side, has its unique set of strengths and limitations. The essential principle behind these clustering methods is that data in the region of the data space with the highest density is deemed to belong to the very same cluster. Clustering occurs on a graph, in which the node represents the data point as well as the edge represents the link between data points.

What is a scalable clustering algorithm?

The growing volume of data sets, along with the limited scalability of clustering methods, has shifted focus to dispersed clustering for dividing huge data sets. The data is randomly split into separate sections of almost similar size. The hard-k indicates or fuzzy k-means method is then used to cluster each subgroup. An ensemble is formed by the centroids of subsets. A centroid communication method addresses the connectivity problem among some of the ensembles of centroids in a transitive manner. The centroids are grouped together to produce a global collection of centroids. The empirical findings demonstrate that perhaps the structure of clusters created more by algorithm is majority of proposed time comparable to the design of clusters produced by classifying all information at once (Hennig, 2007).

References

Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis, 52(1), 258-271. doi: 10.1016/j.csda.2006.11.025

Mikut, R., & Reischl, M. (2011). Data mining tools. Wiley Interdisciplinary Reviews: Data Mining And Knowledge Discovery, 1(5), 431-443. doi: 10.1002/widm.24

About the Author

Daniel Kevins

Follow me