by Eric Bunch
Posted on 01 Jun 2018
Spectral clustering is a way to cluster data that has a number of benefits and applications. It relies on the eigenvalue decomposition of a matrix, which is a useful factorization theorem in matrix theory. We will look into the eigengap heuristic, which give guidelines on how many clusters to choose, as well as an example using breast cancer proteome data.
Spectral clustering should really be viewed as a graph clustering algorithm, in the sense that a data clustering problem is first translated to a graph clustering problem, graph clusters are found, and then translated back to the data clustering problem. Spectral clustering gives a way of grouping together nodes in a graph that are similarly connected.
Here is a short list of some pros and cons of spectral clustering compared to other clustering methods.
Pros
Cons
Speak here about importance of assumptions of kmeans clustering.
For a square $n\times n$ matrix $M$, with $n$ linearly independent eigenvectors, we can find a factorization of $M$ as
\[M = UDU^{-1}\]Where $D$ is the diagonal matrix of eigenvalues of $M$. The $i^{th}$ column of $U$, ${\bf u_i}$ is the eigenvector corresponding to the $d_{i,i}$ entry of $D$.
In order to formulate the spectral clustering algorithm, we need to translate the data into a graph representation. Our aim will be to form the Laplacian matrix of the graph, and then perform spectral clustering on that. To do this we need a few objects:
There are a number of meaningful ways to represent a data set as a graph, and the most appropriate one to choose will of course depend on the particular data set and problem being addressed. We’ll outline a few standard methods here to get us started.
In the following three similarity graphs, the set of nodes of the graph is defined to be the set of data points in the data set; denote them by ${v_1,…,v_n }$.
$\epsilon$-neighborhood Graph Here real number $\epsilon$ is fixed, and all nodes that are less than $\epsilon$ in distance apart are connected with an edge. This graph is typically considered as an unweighted graph.
$k$-nearest Neighbor Graph Here we place an edge between $v_i$ and $v_j$ if $v_j$ is among the $k$ nearest neighbors of $v_i$ or if $v_i$ is among the $k$ nearest neighbors of $v_j$.
Mutual $k$-nearest neighbor graph Here we place an edge between $v_i$ and $v_j$ if $v_j$ is among the $k$ nearest neighbors of $v_i$ and if $v_i$ is among the $k$ nearest neighbors of $v_j$.
For these definitions, we’ll think in general terms of a graph, and then come back to our special case. So let $G = (V, E)$ be an undirected graph with vertex set $V = {v_1,…,v_n}$.
The degree matrix $D$ of a graph $G = (V, E)$ is the $\mid V\mid \times \mid V\mid$ matrix defined by
\[D_{i,j} := \begin{cases} deg(v_i) & \text{if} \hspace{5pt} i = j \\ 0 & \text{otherwise} \end{cases}\]where $deg(v_i)$ of a vertex $v_i \in V$ is the number of edges that terminate at $v_i$.
The adjacency matrix $A$ of a simple graph–no edges starting and ending at the same vertex–$G = (V, E)$ the adjacency matrix is the $\mid V\mid \times \mid V\mid$ matrix defined by
\[A_{i,j} := \begin{cases} 1 & \text{if there is an edge between } v_i \text{ and } v_j \\ 0 & \text{otherwise} \end{cases}\]Notice that since we assumed $G$ was a simple graph, the diagonal entries of $A$ are all zero.
The Laplacian matrix $L$ of a simple graph $G$ is defined as
\[L = D - A.\]Stated differently,
\[L_{i, j} = \begin{cases} deg(v_i) & \text{if } i = j \\ -1 & \text{if } i eq j \text{ there is an edge between } v_i \text{ and } v_j\\ 0 & \text{otherwise} \end{cases}\]The normalized Laplacian matrix $L_{norm}$ of a simple graph $G$ is defined to be
\[L_{norm} := D^{-1}L = I - D^{-1}W\]Input: Similarity matrix $S \in \mathbb{R}^{n×n}$, number $k$ of clusters to construct.
In this section we will use spectral clustering on a manufactured data set from Scikit-learn
; the noisy moons:
I chose to generate this data set to have 120 points, and a noise factor of 0.05
. Below is a representation of the epsilon graph of this data set with $\epsilon = 0.1$.
As we can see from the plot of the data as well as the graph visualization, there are two distinct clusters represented. We then compute the normalized graph Laplacian matrix $L_{norm}$, and find the eigenvalue decomposition it; that is, we find an invertible matrix $U$ and diagonal matrix $D$ where
\[L_{norm} = UDU^{-1}\]Since we are able to see the data, we can “cheat” and say we want to cluster the data into two clusters. In general, it is not possible to look at data and tell so easily how many clusters we should have. But there is a heuristic to help us choose the number of clusters, which we will learn about in the next section.
Since we have already determined that we are looking for two clusters, we take the two columns, ${\bf u_1}$ and ${\bf u_2}$, of $U$ corresponding to the two smallest eigenvalues of $L_{norm}$. Then we use $k$-means clustering with $k=2$ to obtain two clusters within the reduced data set $[{\bf u_1}, {\bf u_2}]$. We then “pull back” the cluster labels from $[{\bf u_1}, {\bf u_2}]$ to $U$, to $L_{norm}$, and finally to the data set. Labeling the clusters with different colors, we can see the graph depiction is colored as expected
And if we plot the original data set with the clusters marked with different colors, we get an expected picture.
With spectral clustering, there is a way to choose the number of clusters $k$ (of course it is not a hard rule, but a heuristic to help inform you). If we let $\lambda_1,…,\lambda_n$ be the eigenvalues of the Laplacian, the goal is to choose $k$ such that $\lambda_1,…,\lambda_k$ are relatively small, but $\lambda_{k+1}$ is relatively large. Unfortunately, it does not seem so far like there is a simple, single-number metric that can be used to evaluate what number of clusters should be chosen.
In this example we look at breast cancer proteome data. This data can be found on Kaggle, and contains the proteome profiles for 77 breast cancer samples. After some preprocessing, a sample of the data looks like this:
NP_057427 | NP_002408 | NP_000415 | NP_000413 | NP_000517 | NP_004439 | NP_005219 | NP_058519 | NP_058518 | NP_001116539 | NP_061155 | NP_001035932 | NP_077006 | NP_000917 | NP_065178 | NP_006836 | NP_006614 | NP_001784 | NP_006092 | NP_001153651 | NP_001159403 | NP_000116 | NP_004314 | NP_060601 | NP_005931 | NP_003003 | NP_113611 | NP_002002 | NP_004487 | NP_008950 | NP_114172 | NP_001062 | NP_001444 | NP_057547 | NP_054895 | NP_001246 | NP_055606 | NP_036451 | NP_000624 | NP_569082 | NP_001159 | NP_001229 | NP_002458 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-A2-A0CM | 2.160157 | 2.623021 | 4.768355 | 0.639321 | 4.933663 | -4.419112 | -0.271711 | -6.013418 | -6.013418 | -6.318320 | 2.277710 | -4.029719 | -4.029719 | -6.509343 | 1.906685 | 2.215260 | -1.601524 | 1.631171 | 2.013217 | -6.395464 | 0.584219 | -4.937078 | -0.433346 | 3.126292 | -1.182743 | 0.286664 | 0.980958 | 1.509945 | -4.433806 | 2.549550 | 2.733226 | 0.896468 | 1.576068 | -1.292949 | 3.541400 | 3.177722 | 0.001309 | -1.792547 | -0.830502 | -0.538113 | 2.516489 | 2.556897 | -2.035927 |
TCGA-A2-A0D2 | 2.249702 | 3.576941 | 2.169868 | 2.968207 | 0.543251 | -5.421010 | -1.206443 | -5.297932 | -5.277974 | -5.311238 | 2.934943 | -5.567372 | -5.567372 | -5.287953 | 1.198555 | 3.264258 | 2.449287 | 2.332862 | 3.606879 | -7.124134 | -0.853843 | -4.153646 | -2.417258 | -0.604362 | 2.545753 | 7.053044 | 2.921637 | 3.743262 | -6.269245 | 2.376105 | 2.781928 | 6.836827 | 2.898353 | -3.694601 | 2.495856 | 2.722053 | 0.373604 | -1.342826 | -4.183584 | -2.889608 | 3.487128 | 1.029442 | -0.714133 |
TCGA-A2-A0EQ | -0.020957 | 1.884936 | -7.407249 | -7.675146 | -5.187535 | -2.795601 | 7.158672 | -9.114133 | -8.762041 | -9.573385 | -0.859091 | -1.241800 | -1.241800 | -7.047502 | 1.961478 | 0.863101 | 1.842838 | -0.369223 | 0.266075 | -3.055843 | -0.836128 | -6.308873 | -0.686872 | 0.162743 | -2.791774 | -5.554936 | 0.047930 | 3.388984 | 0.063239 | 0.545453 | -0.273546 | 1.460128 | -0.174083 | -1.410193 | 0.702364 | -1.402538 | 0.001309 | -0.210764 | 1.934688 | -0.538113 | 0.798041 | 2.003576 | -2.035927 |
TCGA-A2-A0EV | -1.364604 | -2.246793 | -3.750716 | -3.882344 | -2.252395 | -3.252209 | -1.574649 | -2.190781 | -2.871327 | -2.190781 | -4.632905 | -2.608071 | -2.608071 | -1.266583 | 1.158737 | -2.039549 | -1.160160 | -1.216172 | -1.185366 | 0.466989 | -1.924724 | 0.649028 | -0.488016 | -6.134027 | 0.534203 | 1.080321 | -0.709263 | -1.199369 | -1.723081 | -2.179579 | -3.311022 | 0.139319 | 0.733046 | 0.018893 | -1.574649 | -4.515280 | 0.001309 | -0.210764 | 2.049328 | -0.538113 | -0.266769 | -3.201798 | -7.724769 |
TCGA-A2-A0EX | -2.506723 | -2.953194 | -0.803378 | -2.315378 | -0.098028 | -1.643795 | -1.212331 | 4.186597 | 3.976493 | 3.942726 | -2.521730 | 0.427232 | 0.415977 | 1.545287 | 1.725376 | -1.812629 | -2.476708 | -0.923438 | -1.966455 | -2.037740 | -1.973959 | -2.409174 | 0.115828 | -0.882167 | 0.419729 | 1.031282 | -1.666306 | -0.083021 | -0.281869 | -3.752341 | -3.317125 | -2.769353 | -0.174083 | -0.822137 | -2.938187 | -3.395914 | -1.827636 | 0.082061 | 0.044543 | -2.079011 | -3.046991 | 2.554537 | -0.443199 |
We will go through this example twice: once with forming the epsilon graph from the data, and proceeding with spectral clustering, and another with forming the mutual nearest neighbors graph instead.
Taking $\epsilon = 0.37$, we can form the epsilon graph from the data, and get the following graph
Next, we perform eigenvalue decomposition on the normalized Laplacian; $L_{norm} = UDU^{-1}$. We then plot the eigenvalues of $L_{norm}$ in increasing order
Using the eigenmap heuristic, we wish to choose an $n$ such that the first $n$ eigenvalues are relatively small, and $\mid \lambda_{n+1} - \lambda_n\mid$ is relatively large. It’s now up to our discretion what to choose, but looking at the eigenvalue plot as well as the network graph, I think it is reasonable to choose $n=4$. We then proceed with the spectral clustering algorithm: we choose the four eigenvectors in $U$ corresponding to the first four eigenvalues. Then we cluster on the lower dimensional data, and assign clusters to the original dataset accordingly. We show the graph representation of the data with clusters colored:
These are fairly decent clusterings, although it is a little “clumpy”: there are two main tightly connected clusters and a lot of nodes that are more loosely connected.
Next, we look at the same proteome data set, but this time we will form the mutual neighbors graph during the spectral clustering process. With a little experimentation, we can find a good number of neighbors: here I chose 9 neighbors for the following graph. Already, we can see that this graph does not have the same “clumpy” look as the epsilon graph, and looks to have better clusters.
We perform the eigenvalue decomposition on the normalized Laplacian $L_{norm}$ and again plot the eigenvalues
Using the eigengap heuristic, let’s choose 5 clusters to cluster the data into. Then we proceed with the spectral clustering algorithm, and display the graph with the clusters colored.
These clusters look much better than those from the epsilon graph.
We have seen how spectral clustering works, and how it differs from other methods like $k$-means, as well as the eigengap heuristic and how it can be used to choose the number of clusters. The code to generate the examples in this post as can be found here.