A Symmetric Measure of Clustering Separation
Bringing some order to bear on unsupervised learning
Suppose you want to cluster 100 stocks. That is, you’re interested in finding out which stocks covary with which ones without bothering about why. Ideally, you want crisp clustering that assigns each stock to one and only one of the clusters. You also want your clusters to display a certain separability property. In some informative low-dimensional space, the clusters should “cluster together.” For instance, one may want to examine whether the clusters “cluster together” when we project the feature matrix on to the first two principal components or some other informative projection/compression. This is the problem that motivated the following. Almost everything that follows is independent of what is being clustered or how it is being clustered. But let’s talk about stocks to stay concrete.
In supervised learning, you have the accuracy score which measures the degree to which a classifier classifies correctly. There is no such immediate solution available to the problem of a “goodness of fit” estimate in unsupervised learning. To fix ideas, consider the following scatterplot. The data is for US stocks from March 1993. Before we do any clustering, we compress high dimensional information about stock features using manifold learning. Specifically, we use sklearn’s multidimensional scaling to compress a 200-feature matrix containing information on 100 stocks into two dimensions: 0 and 1 — these will serve as our “pseudo-principal components” for data visualization and further analysis. (Other researchers may be more comfortable using principal components. The attraction of 2-d MDS is that, unlike principal components, it preserves the relative distance between stocks in feature space.) Separately, we learn clusters from the uncompressed feature matrix using affinity propagation. We throw out clusters with fewer than 5 stocks. This gives us four clusters which we then represent in our 2-dimensional embedding in following graph.
If the clustering algo and feature compression are any good, we should see the clusters “clustering together.” That is indeed what we find on visual inspection. But how good is this separation quantitatively? The point of this late night dispatch is to alert researchers to a novel solution to this problem.
Let’s simplify the problem. Consider a pair of clusters. Say, the clusters we have called 0 and 1. A natural solution to the problem in this reduced form is to ask about the highest classification accuracy of any hyperplane — ie, the accuracy of the line that best separates the two clusters. This is easily computed using a support vector machine. We report this quantity in the title of the graphs.
OK. So, we have a solution for a pair of clusters. We can naturally extend it to the clustering exercise as whole as follows. We simply compute the accuracy of the optimal hyperplanes for all pairs of clusters and take their average. (There are “n choose 2” = n(n-1)/2 of them where n is number of clusters.) The average accuracy over all combinations is symmetric with respect to permutations of the clusters. It’s a sort of Haar measure. But what is really exciting is that it gives us a tidy summary measure of how much our clusters actually “cluster together.”
For our exercise, we obtain an accuracy of 96.4%, suggesting that our unsupervised learner and our compression algorithm are effective. Indeed, for two out of six pairs, we obtain perfect separability. Not bad for a day’s work. But it also highlights a potential problem. The measure can be gamed by choosing fewer clusters. A robust extension of this measure would sport a penalty for choosing fewer clusters. That would, however, require a hyperparameter. The elegance of our measure is that there are no hyperparameters to tune. And in comes with a natural interpretation as “the probability of separation by low-dimensional hyperplanes.” Finally, the measure is easily generalized beyond 2 dimensions to N dimensions where N is any positive integer smaller than the dimension of the feature space.
"Internal Clustering Validation Measures" might be what you are looking for, but still there are many options e.g. https://doi.org/10.1109/ICDM.2010.35