Chapter 8 Clustering
Grouping objects into classes, or classification, is a basic cognitive ability and a fundamental scientific methodology and procedure. The definitions had different historical perspectives:
- Hastie et al. (2001): objects in a cluster “are more closely related to oneanother than objects assigned to different clusters.”
- Han & Kamber (2006): “A cluster is a collection of data objects that aresimilar to one another within the same cluster and are dissimilar to theobjects in other clusters.”
8.0.1 Clustering
The clustering problem: given a data set, divide it into groups such that the homogeneity of each group is maximized or the separation between groups is maximized, or both are maximized.
Homogeneity and separation can be formalized in different ways:
- Similarity or dissimilarity measure
- A domain-dependent real function of object pairs measuring their similarity or dissimilarity is assumed
- Similar objects are homogeneous, dissimilar ones are separated
- Density estimation
- A statistical estimate of the probability density function that generated the data is computed
- Homogeneity is high in regions where the estimated p.d.f. is large
- Frequency and expectation
- The number of objects in space volumes
- Homogeneity is high in the collection of space volumes where the number exceeds its expectation
8.0.2 Type of Clustering Models
Clustering models can be categorized according to nesting and coverageproperties:
- Hierarchical or flat
- Hierarchical: Clusters may be subdivided into smaller, contained sub-clusters, forming a hierarchy.
- Exclusive, overlapping or fuzzy
- Exclusive: Any object is an element of exactly one cluster.
- Overlapping: Any object may be an element of more than one cluster.
- Fuzzy: For each cluster, a membership function on objects onto [0,1] is defined.
- Complete or partial
- Complete: Any object is an element of some cluster
- Partial: An object may belong to no cluster; objects not belonging to any cluster can be deemed as noise or outliers.