Local Density versus k-means Performance on KDD 99 dataset

18 Aug 2014

I did a comparison between two clustering methods. One method is from a recent science paper called Clustering by fast search and find of density peaks and the other is k-means. Based on visual inspection and clustered result’s summary, the local density wins. However, in terms of computation complexity and memory spending, k-means seems to be better at manipulating big data set than local density method.

Cluster by local density
K-means
Comparison on KDD99
Revised OMP

Cluster by local density

Rodriguez and Laio hypothesized that cluster centers are recognized as local density maxima that are far away from any points of higher density. Then they defined two metrics $\rho$ and $\delta$ to measure the cluster’s above attributes (local density and distance to another cluster of higher density) and pick the observations that have both larger $\rho$ and $\delta$ as cluster centers. Please refer to the paper or my previous summary post for details.

K-means

K-means randomly initializes $K$ observations as the center cluster. Then it iterates the following assignment and update processes and stops when assignments no longer change or iteration number goes beyond predefined iteration threshold.

Variable	Definition	Type
$\underline{x}$	Observation vector	Feature Vector
$S_{i}^{t}$	Observations cluster $i$ contains at $(t)$ th iteration	Set
$\underline{m}_{i}^{(t)}$	$i$ th cluster center at $(t)$ th iteration	Feature Vector

Initial step: Randomly pick $k$ observations as cluster centers (or cluster means) $\underline{m}_{1}^{(1)},\underline{m}_{2}^{(1)},\cdots,\underline{m}_{k}^{(1)}$ . Cluster mean is defined as the average of all observations in that cluster:

$\begin{equation} \underline{m}_{i}=\frac{1}{\left\Vert S_{i} \right\Vert_{0}}\sum_{\underline{x}\in S_{i}}\underline{x} \end{equation}$

where $\left\Vert S_{i} \right\Vert_{0}$ counts the number of observations in set $i$ .

Assignment step:

$\begin{equation} S_{i}^{(t)} = \left\{ \underline{x}\mid \lVert \underline{x} -\underline{m}_{i}^{(t-1)} \rVert^{2} \leq \lVert \underline{x} -\underline{m}_{j}^{(t-1)} \rVert^{2}\right\} \end{equation}$

where $S_{i}^{t}$ is what cluster $i$ contains at $(t)$ th iteration. The above formula tells us to assign observation $\underline{x}$ to its nearest cluster center $i$ .

Update step: Calculate new mean or update cluster center after the assignment step.

$\begin{equation} \underline{m}_{i}^{(t+1)}=\frac{1}{\left\Vert S_{i}^{(t)} \right\Vert_{0}}\sum_{\underline{x}\in S_{i}^{(t)}}\underline{x} \end{equation}$

The algorithm can only converge to a local minimum, since the optimization problem k-mean tries to solve is NP hard. k-means tries to find the optimal sets $\mathbb{S}=\{S_{1},{S_2},\cdots,S_{k}\}$ that minimizes the within-cluster sum of squares. However, we can run k-means many times and pick the best one that minimize the following.

$\begin{equation} \underset{\mathbb{S}}{argmin} \sum_{i=1}^{k} \sum_{\underline{x}\in S_{i}}\left\Vert \underline{x} - \underline{m}_{i} \right\Vert^{2} \end{equation}$

Comparison on KDD99

KDD99 is a network security dataset. I used the 10 percent training subset which is of dimension $494021 \times 42$ . I disregarded all categorical variables, since they bring some trouble in my following data visualization.

For the local density method, I minorly revised the matlab code provided in their paper since I didn’t want to calculate the halo of cluster (I think the cluster halo or outliers are defined as the controversial points between clusters that can be assigned to two or more clusters. In their original code, they assign a point as cluster halo if its density $\rho$ is below the cluster center’s neighborhood average density). Because the method needs to calculate observations’ pairwise distance, it has to save at least $N(N-1)/2$ numbers for the distance matrix $D$ . However, matlab’s maximum memory for storing a vector is 677 MB in a 32-bit system. I have to give up by randomly sampling only 10K of that subset and my comparison is based on a subset with sampling seed 2014. You can set the seed and reproduce my result exactly via:

rng(2014);
idx = randsample(size(M,1), 10000);

For the k-means part, I used matlab’s build-in kmeans function and replicated the algorithm for 1000 times to select the best solution.

Data Visualization

I used PCoA to reconstruct the spacial relationship between different points by their distance matrix $D$ . In matlab, you can realize PCoA using cmdscale, pass the distance matrix $D$ to it and it will return a reduced dimension matrix $Y$ and $Y^{T}Y$ ’s eigen values. If the first 2 eigen values are comparable larger than the rest, we can safely preserve large information for visualization by the first 2 vectors in matrix $Y$ while discarding the minor one. You can also use mdscale which preforms nonmetric multidimensional scaling on the dissimilarity matrix $D_{n \times n}$ (distance matrix is one type of dissimilarity matrix).

>> [Y,eigvals] = cmdscale(D);
>> eigvals(1:4)

ans =

   1.0e+03 *

    1.9623
    1.3507
    0.2575
    0.1995

Since $1.9623$ and $1.3507$ are much larger than the rest, we can visualize using the first two vectors in $Y$ . All data points are colored by their ground truth and there are five categories summarized in the table below.

The ground truth

Code	Type of attack	Number	Color
1	normal	1984	Blue
2	probe	73	Green
3	denial of service (DOS)	7941	Yellow
4	user-to-root (U2R)	0	NA
5	remote-to-local (R2L)	2	Red

Cluster Result 1

Local Density

LD Proformance

Cluster	Color	Elements	C1	C2	C3	C5	Purity	MDtC
1	Blue	6107	289	1	5817	0	0.9525	2.75
2	Yellow	2230	58	48	2124	0	0.9525	3.24
3	Red	1663	1637	24	0	2	0.9844	5.11
Total		10K

K-means: K = 5

K-means K = 5 Proformance

Cluster	Color	Elements	C1	C2	C3	C5	Purity	MDtC
1	Blue	1200	2	29	1169	0	0.9742	1.76
2	Green	561	537	24	0	0	0.9572	4.46
3	Yellow	6105	287	1	5817	0	0.9528	3.03
4	Orange	1099	1097	0	0	2	0.9982	3.77
5	Red	1035	61	19	955	0	0.9227	3.81
Total		10K

Cluster Result 2

K-means: K = 3

K-means K = 3 Proformance

Cluster	Color	Elements	C1	C2	C3	C5	Purity	MDtC
1	Blue	6117	294	6	5817	0	0.9509	3.69
2	Yellow	1543	1522	19	0	2	0.9864	4.73
3	Red	2340	168	48	2124	0	0.9077	3.84
Total		10K

Performance Comparison

Since the unsupervised learning is evaluated by how well they grouped the ground truth labels into clusters. We can define the purity of the algorithm by $Purity = \frac{\text{Maximum Grouped Ground Truth}}{\text{Number of Elements}}$ . As we can see from the above Purity, cluster by local density beats k-means by large.

It is somewhat difficult to calculate attack type wise precision such as DoS, probe detection rate. Since unsupervised learning lacks important labeling information, its performance in multiple classification problem should be considerably bad than some supervised learning methods for example random forest, svm etc. However, it is more essential for us to detect the attack from the normal than to further know its type. Life would be a lot of easier, if we can group all different attacks into one and treat the above as bi-classification problem.

After observing the ground truth visualization, we can easily find that the normal traffic (blue points) are more spread out than the attacks. We can simply measure the level of spread by calculating the max sum of distance of all points to its group’s center. The maximum sum of distance to center is defined as

$\begin{equation} SDtC_{i}=\sum_{j=1}^{p}\max_{\underline{x}\in S_{i}}\left|(\underline{x})_{j}-(\underline{m}_{i})_{j}\right| \end{equation}$

where $\underline{x}_{j}$ denotes the $j$ th dimension of vector $\underline{x}$ , $\underline{m}_{i}$ is the center of group $i$ and $SDtC_{i}$ as the group $i$ ’s maximum sum of distances. Although $SDtC$ might sound less straight forward than the sum of dimensional-wise variance, I think $SDtC$ is more computational saving.

$\sum_{j=1}^{p}Var\left[\underline{X}_{j}-E\left[\underline{X}_{j}\right]\right]$

Anyway after assigning the maximum $SDtC$ ’s group as normal cluster, we now can calculate the following performance table:

Method	Precision	False Positive Rate
Local Density	0.99676	0.17490
K-means K = 5	0.99701	0.72933
K-means K = 3	0.99738	0.23286

In the case of false positive rate, local density method wins. But the false positive rate is such high that unfortunately it is of little practical usage.