Affinity Propagation Clustering
Affinity Propagation (AP) is an unsupervised machine learning algorithm that is particularly useful for clustering data without prior knowledge of the number of clusters. It works by sending messages between pairs of samples until convergence. A high-quality set of exemplars (cluster centers) and corresponding clusters are then chosen based on these messages.
Advantages of Affinity Propagation
-
Automatic Selection of Cluster Centers: AP does not require the number of clusters to be determined or estimated before running the algorithm. It automatically selects the most representative data points as cluster centers.
-
Flexibility: It can handle complex structures and is effective at finding clusters in a dataset with preference information.
-
Quality of Clustering: Often produces higher-quality clustering results compared to other algorithms like K-means, especially when the number of clusters is not known beforehand.
An example of clustering using Affinity Propagation. The squares represent selected exemplars.
Disadvantages of Affinity Propagation
-
Computational Complexity: AP has a higher computational complexity than many other clustering algorithms, which can make it impractical for very large datasets.
-
Memory Usage: The algorithm requires O(N²) memory for processing, where N is the number of samples, which can be prohibitive for large datasets.
-
Sensitivity to Parameters: While it does not require the number of clusters, it is sensitive to the preference and damping parameters, which can affect the number and quality of the resulting clusters.
Sample Code for Affinity Propagation
Here is a basic example of how to use the Affinity Propagation algorithm in Python with scikit-learn:
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5, random_state=0)
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.figure(figsize=(8, 6))
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Scenarios to Use Affinity Propagation
-
Small to Medium Datasets: Due to its computational and memory demands, AP is best suited for small to medium-sized datasets.
-
Unknown Number of Clusters: AP is ideal in scenarios where the number of clusters is not known beforehand.
-
Preference Information Available: If there is information available that can be used to inform the preference of each data point to be chosen as an exemplar, AP can leverage this to improve clustering.
-
Quality over Speed: In cases where the quality of the clustering result is more important than the speed or computational resources, AP can be a good choice.
Affinity Propagation can be a powerful tool for data analysis, but it is important to consider its computational demands and ensure that the parameters are tuned appropriately for the dataset at hand.