scat1d {Hmisc} | R Documentation |
scat1d
adds tick marks (bar codes. rug plot) on any of the four
sides of an existing plot, corresponding with non-missing values of a
vector x
. This is used to show the data density. Can also place
the tick marks along a curve by specifying y-coordinates to go along
with the x
values.
If any two values of x
are within eps*w
of each other, where eps
defaults to .001 and w
is the span of the intended axis, values of
x
are jittered by adding a value uniformly distributed in
[-jitfrac*w, jitfrac*w]
, where jitfrac
defaults to .008.
Specifying preserve=TRUE
invokes jitter2
with a different logic of
jittering. Allows plotting random sub-segments to handle very large
x
vectors (see tfrac
).
jitter2
is a generic method for jittering, which does not add
random noise. It retains unique values and ranks, and randomly
spreads duplicate values at equidistant positions within limits of
enclosing values. jitter2
is especially useful for numeric
variables with discrete values, like rating scales. Missing values
are allowed and are returned. Currently implemented methods are
jitter2.default
for vectors and jitter2.data.frame
which returns
a data.frame with each numeric column jittered.
datadensity
is a generic method used to show data densities in more
complex situations. In the Design library there is a datadensity
method for use with plot.Design
. Here, another datadensity
method
is defined for data frames. Depending on the which
argument, some
or all of the variables in a data frame will be displayed, with
scat1d
used to display continuous variables and, by default, bars
used to display frequencies of categorical, character, or discrete
numeric variables. For such variables, when the total length of value
labels exceeds 200, only the first few characters from each level are used.
By default, datadensity.data.frame
will construct
one axis (i.e., one strip) per variable in the data frame. Variable
names appear to the left of the axes, and the number of missing values
(if greater than zero) appear to the right of the axes. An optional
group
variable can be used for stratification, where the different
strata are depicted using different colors. If the q
vector is
specified, the desired quantiles (over all group
s) are displayed
with solid triangles below each axis.
When the sample size exceeds 2000 (this value may be modified using
the nhistSpike
argument, datadensity
calls histSpike
instead of
scat1d
to show the data density for numeric variables. This results
in a histogram-like display that makes the resulting graphics file
much smaller. In this case, datadensity
uses the minf
argument
(see below) so that very infrequent data values will not be lost on
the variable's axis, although this will slightly distort the histogram.
histSpike
is another method for showing a high-resolution data
distribution that is particularly good for very large datasets (say
n
> 1000). By
default, histSpike
bins the continuous x
variable into 100
equal-width bins and then computes the frequency counts within bins
(if n
does not exceed 10, no binning is done).
If add=FALSE
(the default), the function displays either proportions or
frequencies as in a vertical histogram. Instead of bars, spikes are
used to depict the frequencies. If add=FALSE
, the function assumes you
are adding small density displays that are intended to take up a small
amount of space in the margins of the overall plot. The frac
argument is used as with scat1d
to determine the relative length of
the whole plot that is used to represent the maximum frequency. No
jittering is done by histSpike
.
histSpike
can also graph a kernel density estimate for x
, or add a
small density curve to any of 4 sides of an existing plot. When y
or curve
is specified, the density or spikes are drawn with respect
to the curve rather than the x-axis.
scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac, eps=ifelse(preserve,0,.001), lwd=0.1, col=par("col"), y=NULL, curve=NULL, bottom.align=FALSE, preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100, type=c('proportion','count','density'), grid=FALSE, ...) jitter2(x, ...) ## Default S3 method: jitter2(x, fill=1/3, limit=TRUE, eps=0, presorted=FALSE, ...) ## S3 method for class 'data.frame': jitter2(x, ...) datadensity(object, ...) ## S3 method for class 'data.frame': datadensity(object, group, which=c("all","continuous","categorical"), method.cat=c("bar","freq"), col.group=1:10, n.unique=10, show.na=TRUE, nint=1, naxes, q, bottom.align=nint>1, cex.axis=sc(.5,.3), cex.var=sc(.8,.3), lmgp=NULL, tck=sc(-.009,-.002), ranges=NULL, labels=NULL, ...) # sc(a,b) means default to a if number of axes <= 3, b if >=50, use # linear interpolation within 3-50 histSpike(x, side=1, nint=100, frac=.05, minf=NULL, mult.width=1, type=c('proportion','count','density'), xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)), ylab=switch(type,proportion='Proportion', count ='Frequency', density ='Density'), y=NULL, curve=NULL, add=FALSE, bottom.align=type=='density', col=par('col'), lwd=par('lwd'), grid=FALSE, ...)
x |
a vector of numeric data, or a data frame (for jitter2 )
|
object |
a data frame or list (even with unequal number of observations per
variable, as long as group is not specified)
|
side |
axis side to use (1=bottom (default for histSpike ), 2=left,
3=top (default for scat1d ), 4=right)
|
frac |
fraction of smaller of vertical and horizontal axes for tick mark lengths.
Can be negative to move tick marks outside of plot. For histSpike ,
this is the relative length to be used for the largest frequency.
When scat1d calls histSpike , it multiplies its frac argument by 2.5.
|
jitfrac |
fraction of axis for jittering. If <=0, no jittering is done. If
preserve=TRUE , the amount of jittering is independent of jitfrac.
|
tfrac |
fraction of tick mark to actually draw. If tfrac<1 ,
will draw a random fraction tfrac of the line segment at each point.
This is useful for very large samples or ones with some very dense points.
The default value is 1 if the number of non-missing observations n
is less than 125, and max(.1, 125/n) otherwise.
|
eps |
fraction of axis for determining overlapping points in x . For
preserve=TRUE the default is 0 and original unique values are
retained, bigger values of eps tends to bias observations from dense
to sparse regions, but ranks are still preserved.
|
lwd |
line width for tick marks, passed to segments
|
col |
color for tick marks, passed to segments
|
y |
specify a vector the same length as x to draw tick marks along
a curve instead of by one of the axes. The y values are often
predicted values from a model. The side argument is ignored
when y is given. If the curve is already represented as a table
look-up, you may specify it using the curve argument instead. y
may be a scalar to use a constant vertical placement.
|
curve |
a list containing elements x and y for which linear interpolation
is used to derive y values corresponding to values of x . This
results in tick marks being drawn along the curve. For histSpike ,
interpolated y values are derived for bin midpoints.
|
bottom.align |
set to TRUE to have the bottoms of tick marks (for side=1 or
side=3 ) aligned at the y-coordinate. The default behavior is to
center the tick marks. For datadensity.data.frame , bottom.align
defaults to TRUE if nint>1 . In other words, if you are only labeling
the first and last axis tick mark, the scat1d tick marks are
centered on the variable's axis.
|
preserve |
set to TRUE to invoke jitter2
|
fill |
maximum fraction of the axis filled by jittered values. If d are
duplicated values between a lower value l and upper value u , then
d will be spread within +/- fill*min(u-d,d-l)/2 .
|
limit |
specifies a limit for maximum shift in jittered values. Duplicate
values will be spread within +/- fill*min(limit,min(u-d,d-l)/2) . The
default TRUE restricts jittering to the smallest min(u-d,d-l)/2 observed and
results in equal amount of jittering for all d. Setting to FALSE
allows for locally different amount of jittering, using maximum
space available.
|
nhistSpike |
If the number of observations exceeds or equals nhistSpike , scat1d
will automatically call histSpike to draw the data density, to
prevent the graphics file from being too large.
|
type |
used by or passed to histSpike . Set to "count" to display
frequency counts rather than relative frequencies, or "density" to
display a kernel density estimate computed using the density function.
|
grid |
set to TRUE if the R grid package is in effect for the
current plot
|
nint |
number of intervals to divide each continuous variable's axis for
datadensity .
For histSpike , is the number of equal-width intervals for which to
bin x , and if instead nint is a character string (e.g.,
nint="all" ), the frequency tabulation is done with no binning. In
other words, frequencies for all unique values of x are derived and
plotted.
|
... |
optional arguments passed to scat1d from datadensity or to
histSpike from scat1d
|
presorted |
set to TRUE to prevent from sorting for determining the order l<d<u.
This is usefull if an existing meaningfull local order would be
destroyed by sorting, as in sin(pi*sort(round(runif(1000,0,10),1))).
|
group |
an optional stratification variable, which is converted to a factor
vector if it is not one already
|
which |
set which="continuous" to only plot continuous variables, or
which="categorical" to only plot categorical, character, or discrete
numeric ones. By default, all types of variables are depicted.
|
method.cat |
set method.cat="freq" to depict frequencies of categorical variables
with digits representing the cell frequencies, with size proportional
to the square root of the frequency. By default, vertical bars are used.
|
col.group |
colors representing the group strata. The vector of colors is
recycled to be the same length as the levels of group .
|
n.unique |
number of unique values a numeric variable must have before it is considered to be a continuous variable |
show.na |
set to FALSE to suppress drawing the number of NA s to the right of
each axis
|
naxes |
number of axes to draw on each page before starting a new plot. You
can set naxes larger than the number of variables in the data frame
if you want to compress the plot vertically.
|
q |
a vector of quantiles to display. By default, quantiles are not shown. |
cex.axis |
character size for draw labels for axis tick marks |
cex.var |
character size for variable names and frequence of NA s
|
lmgp |
spacing between numeric axis labels and axis (see par for mgp )
|
tck |
see tck under par
|
ranges |
a list containing ranges for some or all of the numeric variables. If
ranges is not given or if a certain variable is not found in the
list, the empirical range, modified by pretty , is used. Example:
ranges=list(age=c(10,100), pressure=c(50,150)) .
|
labels |
a vector of labels to use in labeling the axes for
datadensity.data.frame . Default is to use the names of the
variables in the input data frame. Note: margin widths computed for
setting aside names of variables use the names, and not these labels.
|
minf |
For histSpike , if minf is specified low bin frequencies are set to
a minimum value of minf times the maximum bin frequency, so that
rare data points will remain visible. A good choice of minf is
0.075. datadensity.data.frame passes minf=0.075 to scat1d to
pass to histSpike . Note that specifying minf will cause the shape
of the histogram to be distorted somewhat.
|
mult.width |
multiplier for the smoothing window width computed by histSpike when
type="density"
|
xlim |
a 2-vector specifying the outer limits of x for binning (and
plotting, if add=FALSE and nint is a number)
|
ylim |
y -axis range for plotting (if add=FALSE )
|
xlab |
x -axis label (add=FALSE ); default is name of input argument x
|
ylab |
y -axis label (add=FALSE )
|
add |
set to TRUE to add the spike-histogram to an existing plot, to show
marginal data densities
|
For scat1d
the length of line segments used is frac*min(par()$pin)
/ par()$uin[opp]
data units, where opp
is the index of the opposite
axis and frac
defaults to .02. Assumes that plot
has already been
called. Current par("usr")
is used to determine the range of data
for the axis of the current plot. This range is used in jittering and
in constructing line segments.
histSpike
returns the actual range of x
used in its binning
scat1d
adds line segments to plot. datadensity.data.frame
draws a
complete plot. histSpike
draws a complete plot or adds to an
existing plot.
Frank Harrell
Department of Biostatistics
Vanderbilt University
Charlottesville VA, USA
f.harrell@vanderbilt.edu
Martin Maechler (improved scat1d
)
Seminar fuer Statistik
ETH Zurich SWITZERLAND
maechler@stat.math.ethz.ch
Jens Oehlschlaegel-Akiyoshi (wrote jitter2
)
Center for Psychotherapy Research
Christian-Belser-Strasse 79a
D-70597 Stuttgart Germany
oehl@psyres-stuttgart.de
segments
, jitter
, rug
, plsmo
, stripplot
,
hist.data.frame
,ecdf
,
hist
, histogram
, table
, density
plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 ) scat1d(x) # density bars on top of graph scat1d(y, 4) # density bars at right histSpike(x, add=TRUE) # histogram instead, 100 bins histSpike(y, 4, add=TRUE) histSpike(x, type='density', add=TRUE) # smooth density at bottom histSpike(y, 4, type='density', add=TRUE) smooth <- lowess(x, y) # add nonparametric regression curve lines(smooth) # Note: plsmo() does this scat1d(x, y=approx(smooth, xout=x)$y) # data density on curve scat1d(x, curve=smooth) # same effect as previous command histSpike(x, curve=smooth, add=TRUE) # same as previous but with histogram histSpike(x, curve=smooth, type='density', add=TRUE) # same but smooth density over curve plot(x <- rnorm(250), y <- 3*x + rnorm(250)/2) scat1d(x, tfrac=0) # dots randomly spaced from axis scat1d(y, 4, frac=-.03) # bars outside axis scat1d(y, 2, tfrac=.2) # same bars with smaller random fraction x <- c(0:3,rep(4,3),5,rep(7,10),9) plot(x, jitter2(x)) # original versus jittered values abline(0,1) # unique values unjittered on abline points(x+0.1, jitter2(x, limit=FALSE), col=2) # allow locally maximum jittering points(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2) # fill 3/3 instead of 1/3 x <- rnorm(200,0,2)+1; y <- x^2 x2 <- round((x+rnorm(200))/2)*2 x3 <- round((x+rnorm(200))/4)*4 dfram <- data.frame(y,x,x2,x3) plot(dfram$x2, dfram$y) # jitter2 via scat1d scat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2) scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2) scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2) pairs(jitter2(dfram)) # pairs for jittered data.frame # This gets reasonable pairwise scatter plots for all combinations of # variables where # # - continuous variables (with unique values) are not jittered at all, thus # all relations between continuous variables are shown as they are, # extreme values have exact positions. # # - discrete variables get a reasonable amount of jittering, whether they # have 2, 3, 5, 10, 20 ... levels # # - different from adding noise, jitter2() will use the available space # optimally and no value will randomly mask another # # If you want a scatterplot with lowess smooths on the *exact* values and # the point clouds shown jittered, you just need # pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y)) lines(lowess(x,y)) } ) datadensity(dfram) # graphical snapshot of entire data frame datadensity(dfram, group=cut2(dfram$x2,g=3)) # stratify points and frequencies by # x2 tertiles and use 3 colors # datadensity.data.frame(split(x, grouping.variable)) # need to explicitly invoke datadensity.data.frame when the # first argument is a list