Previously published on Today Software Magazine.
Clustering is a means to analyze data obtained by measurements. This allows us to cluster data into classes and use obtained classes as a basis for machine learning, faster measurement analysis or approximate future measurement values by extrapolation. In the following sections we will try to cover the topic of how to cluster data. This technique is especially useful when dealing with large amounts of data, a scenario not uncommon in regards to the explosion of data and information we are dealing with nowadays.
Clustering is a process that examines a collection of “points” and groups called points into “clusters” according to some distance measure. The main goal of clustering is to achieve a state where points in the same cluster have a small distance from one another and points in different clusters are at a large distance from one another. The definition of the terms “large” and “small” depends on the domain in which clustering is applied.
An example of clustering in a bi-dimensional space can be seen in the following image:
Modern clustering problems however involve Euclidean spaces of very high dimension, or even more fun is the case when they involve spaces that are not Euclidean thus making the distance measure very unintuitive.
A real world scenario for clustering might be the need to cluster documents by their topic based on the occurrence of common, unusual words in the documents or the requirement to cluster moviegoers by the type(s) of movies they like in the context of various business processes.
The concept of distance is a measure described by a few basic properties. A distance measure is always non-negative (only the distance between a point and itself is zero), it is symmetric (it does not matter the order in which you consider the points when computing their distance) and it must obey the triangle inequality (the distance from X to Y to Z is never less than the distance going from X to Z directly).
There are two types of clustering strategies available in the state of the art: hierarchical algorithms and point-assignment algorithms.
Hierarchical algorithms start with each point in its own cluster, combines clusters based on various definitions of “closeness” and stops when further combination leads to clusters that are undesirable (for example when we have reached a predetermined number of clusters for our domain or when a resulting cluster has points that are spread out over too large a region).
In point-assignment algorithms points are considered in some order and each one is assigned to the cluster into which it fits best. It is usually preceded by a short phase in which initial clusters are estimated. Variations of these algorithms occasionally combine or split clusters or allow points to be unassigned if they are too far from any of the current clusters in order to reduce noise.
Curse of Dimensionality
When dealing with large amounts of data the concept of the curse of dimensionality appears and needs to be considered during processing. This curse refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces. The common theme of these problems is that when the dimensionality increases, the volume http://en.wikipedia.org/wiki/Volume of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. For example high-dimensional Euclidean spaces and also non-Euclidean spaces have a number of unintuitive properties such as almost all pairs of points are equally far away from one another or almost any two vectors are orthogonal.
For Euclidean spaces hierarchical clustering begins with every point in its own cluster and large clusters will be constructed by combining two or smaller clusters. In this scenario we have to decide in advance how the clusters will be represented, how we will choose which two clusters to merge and when we will stop combining the clusters. The basic algorithm is illustrated in the figure below:
WHILE it is not time to stop D0
pick the best two clusters to merge;
combine those two clusters into one cluster;
For non-Euclidean spaces we need to use some distance measure that is computed from points such as a Jaccard, cosine or edit distance. A restriction in this scenario is that we cannot base distances on the “location” of points. Another restriction is that we cannot represent a cluster by its centroid and thus, we have to pick one of the points of the cluster as a representative equivalent and ideally it should be a point close to all other points of the cluster so in some sense it lies in the “center”. Such a point is called a clustroid and can be obtained by the following techniques:
- Minimize the sum of the distances to the other points in the cluster
- Minimize the maximum distance to another point in the cluster
- Minimize the sum of the squares of the distances to the other points in the cluster
We will only present two point-assignment clustering algorithms in the following section but there can be variations depending on various domain requirements.
The K-means algorithm assumes a Euclidean space and also that the number of clusters, k, is known in advance. It is possible to deduce the value of k by trial and error or other methods. The basic algorithm is illustrated in the figure below:
Initially choose k points that are likely to be in
Make these points the centroids of their clusters;
FOR each remaining point p D0
find the centroid to which p is closest;
Add p to the cluster of that centroid;
Adjust the centroid of that cluster to account for p;
A variation of this algorithm is the BFR (Bradley, Fayyad, and Reina) algorithm which enables us to execute k-means on data that is too large to fit in main memory. It assumes the shape of the cluster must be normally distributed about a centroid i.e. the axes of the cluster must align with the axes of the space, assumption illustrated in the following figure:
The BFR algorithm begins by selecting k points. The points of the data file are read in chunks and each chunk must consist of few enough points that they can be processed in main memory. The algorithm stores in main memory summaries of the k clusters and some other helper data but not the main data being processed.
The CURE (Clustering Using REpresentatives) algorithm is used for large-scale-clustering and does not assume anything about the shape of the clusters. Instead of representing clusters by their centroid, it uses a collection of representative points that are as far away from each other as possible and that define the borders of the cluster. The algorithm then proceeds to assign new points to clusters based on the closest representative point. An illustration of the clustering outcome is illustrated in the figure bellow and it is clearly a different form of representation from the clustering outputs of previous algorithms:
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called cluster are more similar in some sense or another to each other than to those in other clusters. It is commonly used in exploratory data mining and statistical data analysis used in many areas such as machine learning.
Cluster analysis is not defined by one specific algorithm but the general task to be solved and can be achieved by various algorithms that differ in definitions of what constitutes a cluster and how to find them optimally.
“Mining of Massive Datasets”, Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman Stanford University
S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” Proc. ACM SIGMOD Intl. Conf. on Management of Data