varclus {Hmisc} | R Documentation |
Does a hierarchical cluster analysis on variables, using the Hoeffding
D statistic, squared Pearson or Spearman correlations, or proportion
of observations for which two variables are both positive as similarity
measures. Variable clustering is used for assessing collinearity,
redundancy, and for separating variables into clusters that can be
scored as a single variable, thus resulting in data reduction. For
computing any of the three similarity measures, pairwise deletion of
NAs is done. The clustering is done by hclust()
. A small function
naclus
is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of NAs
in common between any two
variables. The diagonals of this sim
matrix are the fraction of NAs
in each variable by itself. naclus
also computes na.per.obs
, the
number of missing variables in each observation, and mean.na
, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
naplot
function makes several plots (see the which
argument).
So as to not generate too many dummy variables for multi-valued
character or categorical predictors, varclus
will automatically
combine infrequent cells of such variables using an auxiliary
function combine.levels
that is defined here.
plotMultSim
plots multiple similarity matrices, with the similarity
measure being on the x-axis of each subplot.
na.pattern
prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled 110
corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"), type=c("data.matrix","similarity.matrix"), method=if(.R.)"complete" else "compact", data, subset, na.action, minlev=0.05) ## S3 method for class 'varclus': print(x, abbrev=FALSE, ...) ## S3 method for class 'varclus': plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...) naclus(df, method) naplot(obj, which=c('all','na per var','na per obs','mean na', 'na per var vs mean na'), ...) combine.levels(x, minlev=.05) plotMultSim(s, x=1:dim(s)[3], slim=range(pretty(c(0,max(s,na.rm=TRUE)))), slimds=FALSE, add=FALSE, lty=par('lty'), col=par('col'), lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05, labelx=TRUE, xspace=.35) na.pattern(x)
x |
a formula,
a numeric matrix of predictors, or a similarity matrix. If x is
a formula, model.matrix is used to convert it to a design matrix.
If the formula excludes an intercept (e.g., ~ a + b -1 ),
the first categorical (factor ) variable in the formula will have
dummy variables generated for all levels instead of omitting one for
the first level. For combine.levels , x is a character, category,
or factor vector (or other vector that is converted to factor). For
plot and print , x is an object created by
varclus . For na.pattern , x is a list, data frame,
or numeric matrix.
For plotMultSim , is a numeric vector specifying the ordered
unique values on the x-axis, corresponding to the third dimension of
s .
|
df |
a data frame |
s |
an array of similarity matrices. The third dimension of this array
corresponds to different computations of similarities. The first two
dimensions come from a single similarity matrix. This is useful for
displaying similarity matrices computed by varclus , for example. A
use for this might be to show pairwise similarities of variables
across time in a longitudinal study (see the example below). If
vname is not given, s must have dimnames .
|
similarity |
the default is to use squared Spearman correlation coefficients, which
will detect monotonic but nonlinear relationships. You can also
specify linear correlation or Hoeffding's (1948) D statistic, which
has the advantage of being sensitive to many types
of dependence, including highly non-monotonic relationships. For
binary data, or data to be made binary, similarity="bothpos" uses as
a similarity measure the proportion of observations for which two
variables are both positive. similarity="ccbothpos" uses a
chance-corrected measure which is the proportion of observations for
which both variables are positive minus the product of the two
marginal proportions. This difference is expected to be zero under
independence. For diagonals, "ccbothpos" still uses the proportion
of positives for the single variable. So "ccbothpos" is not really
a similarity measure, and clustering is not done. This measure is
useful for plotting with plotMultSim (see the last example).
|
type |
if x is not a formula, it may be a data matrix or a similarity matrix.
By default, it is assumed to be a data matrix.
|
method |
see hclust . The default, for both varclus and naclus , is
"compact" (for R it is "complete" ).
|
data |
|
subset |
|
na.action |
These may be specified if x is a formula. The default na.action is
na.retain , defined by varclus . This causes all observations to
be kept in the model frame, with later pairwise deletion of NA s.
|
ylab |
y-axis label. Default is constructed on the basis of similarity .
|
legend. |
set to TRUE to plot a legend defining the abbreviations
|
loc |
a list with elements x and y defining coordinates of the
upper left corner of the legend. Default is locator(1) .
|
maxlen |
if a legend is plotted describing abbreviations, original labels
longer than maxlen characters are truncated at maxlen .
|
labels |
a vector of character strings containing labels corresponding to columns in the similar matrix, if the column names of that matrix are not to be used |
... |
passed to plclust (or to dotchart or dotchart2 for naplot ).
|
obj |
an object created by naclus |
which |
defaults to "all" meaning to have naplot make 4 separate
plots. To
make only one of the plots, use which="na per var" (dot chart of
fraction of NAs for each variable), ,"na per obs" (dot chart showing
frequency distribution of number of variables having NAs in an
observation), "mean na" (dot chart showing mean number of other
variables missing when the indicated variable is missing), or
"na per var vs mean na" , a scatterplot showing on the x-axis the
fraction of NAs in the variable and on the y-axis the mean number of
other variables that are NA when the indicated variable is NA.
|
minlev |
the minimum proportion of observations in a cell before that cell is
combined with one or more cells. If more than one cell has fewer than
minlev*n observations, all such cells are combined into a new cell
labeled "OTHER" . Otherwise, the lowest frequency cell is combined
with the next lowest frequency cell, and the level name is the
combination of the two old level levels.
|
abbrev |
set to TRUE to abbreviate variable names for plotting or
printing. Is set to TRUE automatically if legend=TRUE .
|
slim |
2-vector specifying the range of similarity values for scaling the
y-axes. By default this is the observed range over all of s .
|
slimds |
set to slimds to TRUE to scale diagonals and
off-diagonals separately |
add |
set to TRUE to add similarities to an existing plot (usually
specifying lty or col )
|
lty |
|
col |
|
lwd |
line type, color, or line thickness for plotMultSim
|
vname |
optional vector of variable names, in order, used in s
|
h |
relative height for subplot |
w |
relative width for subplot |
u |
relative extra height and width to leave unused inside the subplot. Also used as the space between y-axis tick mark labels and graph border. |
labelx |
set to FALSE to suppress drawing of labels in the x direction
|
xspace |
amount of space, on a scale of 1:n where n is the number
of variables, to set aside for y-axis labels
|
options(contrasts= c("contr.treatment", "contr.poly"))
is issued
temporarily by varclus
to make sure that ordinary dummy variables
are generated for factor
variables. If a categorical or character
variable has no level containing at least a fraction minlev
of the
data, that variable is omitted from consideration and a warning is
printed.
for varclus
or naclus
, a list of class varclus
with elements
call
(containing the calling statement), sim
(similarity matrix),
n
(sample size used if x
was not a correlation matrix already -
n
is a matrix), hclust
, the object created by hclust
,
similarity
, and method
. For plot
, returns the object created by
plclust
. naclus
also returns the two vectors listed under
description, and naplot
returns an invisible vector that is the
frequency table of the number of missing variables per observation.
plotMultSim
invisibly returns the limits of similarities used in
constructing the y-axes of each subplot. For similarity="ccbothpos"
the hclust
object is NULL
.
na.pattern
creates an integer vector of frequencies.
plots
Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harrell@vanderbilt.edu
Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990. Cary NC: SAS Institute, Inc.
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546–57.
hclust
, plclust
, hoeffd
, rcorr
, cor
, model.matrix
,
locator
, na.pattern
set.seed(1) x1 <- rnorm(200) x2 <- rnorm(200) x3 <- x1 + x2 + rnorm(200) x4 <- x2 + rnorm(200) x <- cbind(x1,x2,x3,x4) v <- varclus(x, similarity="spear") # spearman is the default anyway v # invokes print.varclus print(round(v$sim,2)) plot(v) # plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE) # the -1 causes k dummies to be generated for k countries # plot(varclus(~ age + factor(disease.code) - 1)) # df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3), e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3)) par(mfrow=c(2,2)) for(m in if(.R.)c("ward","complete","median") else c("compact","connected","average")) { plot(naclus(df, method=m)) title(m) } naplot(naclus(df)) n <- naclus(df) plot(n); naplot(n) na.pattern(df) # builtin function x <- c(1, rep(2,11), rep(3,9)) combine.levels(x) x <- c(1, 2, rep(3,20)) combine.levels(x) # plotMultSim example: Plot proportion of observations # for which two variables are both positive (diagonals # show the proportion of observations for which the # one variable is positive). Chance-correct the # off-diagonals by subtracting the product of the # marginal proportions. On each subplot the x-axis # shows month (0, 4, 8, 12) and there is a separate # curve for females and males d <- data.frame(sex=sample(c('female','male'),1000,TRUE), month=sample(c(0,4,8,12),1000,TRUE), x1=sample(0:1,1000,TRUE), x2=sample(0:1,1000,TRUE), x3=sample(0:1,1000,TRUE)) s <- array(NA, c(3,3,4)) opar <- par(mar=c(0,0,4.1,0)) # waste less space for(sx in c('female','male')) { for(i in 1:4) { mon <- (i-1)*4 s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d, subset=month==mon & sex==sx)$sim } plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'), add=sx=='male', slimds=TRUE, lty=1+(sx=='male')) # slimds=TRUE causes separate scaling for diagonals and # off-diagonals } par(opar)