Friday, September 25, 2009

determine number of clusters?

At present, many community detection methods have been proposed. "Unfortunately", most of these methods require users to specify the number of clusters, which seems insane at first glimpse.

Imaging a conversation like this:

Person: powerful machine, can you tell me something about the data set I have?
Machine: You honored! Sure, but please tell me how many clusters are there.
Person: I do not know the data set. That's why I am asking you!
Machine: Sorry, I cannot tell you anything unless you tell me something.
Person: ....... (stupid machine!)
Machine: ...... (this user is an idiot about data mining!)

So when an algorithm claims that it automatically determines the number of clusters, everyone just embraces it. For instance, modularity is claimed to find the optimal number of clusters. So are Bayesian guys by enforcing a Chinese restaurant process as a prior. These methods, as I understand, is like setting a threshold for the clustering process or setting an objective to optimize. It basically tells a user that under this objective function, this number of clusters seems reasonable. Instead of setting the number of clusters, these algorithm implicitly set a parameter or an biased objective function to optimize.

But do we really need to determine the number of clusters?

If just for data exploration, I think a hierarchical organization of data objects are more reasonable. But it seems that current state of hierarchical clustering is too far away from satisfactory. Most methods just give you a binary tree, which does not reflect any interesting structure. Another disadvantage is high sensitivity of the hierarchical structure to the algorithm implementation and data processing order.

Instead of begging for machines to determine the number of communities, organization clusters of multi-resolutions is more interesting and more consistent with users' request.

No comments: