silhouette {cluster} | R Documentation |
Compute silhouette information according to a given clustering in k clusters.
silhouette(x, ...) ## Default S3 method: silhouette (x, dist, dmatrix, ...) ## S3 method for class 'partition': silhouette(x, ...) sortSilhouette(object, ...) ## S3 method for class 'silhouette': summary(object, FUN = mean, ...) ## S3 method for class 'silhouette': plot(x, nmax.lab = 40, max.strlen = 5, main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]), col = "gray", do.col.sort = length(col) > 1, border = 0, cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)
x |
an object of appropriate class; for the default
method an integer vector with k different integer cluster
codes or a list with such an x$clustering
component. Note that silhouette statistics are only defined if
2 <= k <= n-1. |
dist |
a dissimilarity object inheriting from class
dist or coercible to one. If not specified,
dmatrix must be. |
dmatrix |
a symmetric dissimilarity matrix (n * n),
specified instead of dist , which can be more efficient. |
object |
an object of class silhouette . |
... |
further arguments passed to and from methods. |
FUN |
function used to summarize silhouette widths. |
nmax.lab |
integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot. |
max.strlen |
positive integer giving the length to which strings are truncated in silhouette plot labeling. |
main, sub, xlab |
arguments to title ; have a
sensible non-NULL default here. |
col, border, cex.names |
arguments passed
barplot() ; note that the default used to be col
= heat.colors(n), border = par("fg") instead.col can also be a color vector of length k for
clusterwise coloring, see also do.col.sort :
|
do.col.sort |
logical indicating if the colors col should
be sorted “along” the silhouette; this is useful for casewise or
clusterwise coloring. |
do.n.k |
logical indicating if n and k “title text” should be written. |
do.clus.stat |
logical indicating if cluster size and averages should be written right to the silhouettes. |
For each observation i, the silhouette width s(i) is
defined as follows:
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the only observation in
its cluster, s(i) := 0 without further calculations).
For all other clusters C, put d(i,C) = average
dissimilarity of i to all observations of C. The smallest of these
d(i,C) is b(i) := min_C d(i,C),
and can be seen as the dissimilarity between i and its “neighbor”
cluster, i.e., the nearest one to which it does not belong.
Finally,
s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).
Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.
silhouette()
returns an object, sil
, of class
silhouette
which is an [n x 3] matrix with attributes. For
each observation i, sil[i,]
contains the cluster to which i
belongs as well as the neighbor cluster of i (the cluster, not
containing i, for which the average dissimilarity between its
observations and i is minimal), and the silhouette width s(i) of
the observation. The colnames
correspondingly are
c("cluster", "neighbor", "sil_width")
.
summary(sil)
returns an object of class
summary.silhouette
, a list with components
si.summary |
numerical summary of the individual
silhouette widths s(i). |
clus.avg.widths |
numeric (rank 1) array of clusterwise
means of silhouette widths where mean = FUN is used. |
avg.width |
the total mean FUN(s) where s are the
individual silhouette widths. |
clus.sizes |
table of the k cluster sizes. |
call |
if available, the call creating sil . |
Ordered |
logical identical to attr(sil, "Ordered") , see
below. |
sortSilhouette(sil)
orders the rows of sil
as in the
silhouette plot, by cluster (increasingly) and decreasing silhouette
width s(i).
attr(sil, "Ordered")
is a logical indicating if sil
is
ordered as by sortSilhouette()
. In that case,
rownames(sil)
will contain case labels or numbers, and
attr(sil, "iOrd")
the ordering index vector.
While silhouette()
is intrinsic to the
partition
clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from silhouette.default()
with
cutree()
and distance as input.
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.
chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see
the references in plot.agnes
.
partition.object
, plot.partition
.
data(ruspini) pr4 <- pam(ruspini, 4) str(si <- silhouette(pr4)) (ssi <- summary(si)) plot(si) # silhouette plot si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra")) summary(si2) # has small values: "canberra"'s fault plot(si2, nmax= 80, cex.names=0.6) op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0), mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2)) for(k in 2:6) plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE) mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101", outer = TRUE, font = par("font.main"), cex = par("cex.main")) par(op) ## Silhouette for a hierarchical clustering: ar <- agnes(ruspini) si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above daisy(ruspini)) plot(si3, nmax = 80, cex.names = 0.5) ## 2 groups: Agnes() wasn't too good: si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini)) plot(si4, nmax = 80, cex.names = 0.5)