transcan {Hmisc} | R Documentation |
transcan
is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results. transcan
automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion - maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to ace
except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and NAs are
allowed. When a variable has any NAs, transformed scores for that
variable are imputed using least squares multiple regression
incorporating optimum transformations, or NAs are optionally set to
constants. Shrinkage can be used to safeguard against overfitting
when imputing. Optionally, imputed values on the original scale are
also computed and returned. For this purpose, recursive partitioning
or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.
By default, transcan
imputes NAs with "best guess" expected values
of transformed variables, back transformed to the original scale.
Values thus imputed are most like conditional medians assuming the
transformations make variables' distributions symmetric (imputed
values are similar to conditionl modes for categorical variables). By
instead specifying n.impute
, transcan
does approximate multiple imputation
from the distribution of each variable conditional on all other
variables. This is done by sampling n.impute
residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size n with replacement is selected from the residuals on n
non-missing values of the target variable, and then a sample of size m
with replacement is chosen from this sample, where m is the number of
missing values needing imputation for the current multiple imputation
repetition. Neither of these bootstrap procedures
assume normality or even symmetry of residuals.
For sometimes-missing categorical variables, optimal scores are
computed by adding the "best guess" predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(impcat="tree"
or "rpart"
are not currently allowed with
n.impute
). The literature recommends using n.impute=5
or greater.
transcan
provides only an approximation to multiple imputation,
especially since it "freezes" the imputation model before drawing the
multiple imputations rather than using different estimates of
regression coefficients for each imputation. For multiple imputation,
the aregImpute
function provides a much better approximation to the
full Bayesian approach while still not requiring linearity assumptions.
When you specify n.impute
to transcan
you can use
fit.mult.impute
to re-fit any model n.impute
times based on
n.impute
completed datasets (if there are any sometimes missing
variables not specified to transcan
, some observations will still be
dropped from these fits). After fitting n.impute
models,
fit.mult.impute
will return the fit object from the last imputation,
with coefficients
replaced by the average of the n.impute
coefficient vectors and with a component var
equal to the
imputation-corrected variance-covariance matrix. fit.mult.impute
can also use the object created by the mice
function in the MICE
library to draw the multiple imputations, as well as objects created
by aregImpute
.
The summary
method for transcan
prints the function call,
R-squares achieved in transforming each variable, and for each variable
the coefficients of all other transformed variables that are used to
estimate the transformation of the initial variable. If
imputed=TRUE
was used in the call to transcan, also uses the
describe
function to print a summary of imputed values. If
long=TRUE
, also prints all imputed values with observation
identifiers. There is also a simple function print.transcan
which merely prints the transformation matrix and the function call. It
has an optional argument long
, which if set to TRUE
causes
detailed parameters to be printed. Instead of plotting while
transcan()
is running, you can plot the final transformations
after the fact using plot.transcan
, if the option
trantab=TRUE
was specified to transcan
. If in addition
the option imputed=TRUE
was specified to transcan
,
plot.transcan
will show the location of imputed values (including
multiples) along the axes.
impute
does imputations for a selected original data variable, on
the original scale (if imputed=TRUE
was given to
transcan
). If you do not specify a variable to impute
, it
will do imputations for all variables given to transcan
which had
at least one missing value. This assumes that the original variables
are accessible (i.e., they have been attach
ed) and that you want
the imputed variables to have the same names are the original variables.
If n.impute
was specified to transcan
you must tell
impute
which imputation
to use.
predict
computes predicted variables and imputed values from a
matrix of new data. This matrix should have the same column variables
as the original matrix used with transcan
, and in the same order
(unless a formula was used with transcan
).
Function
is a generic function generator.
Function.transcan
creates S functions to transform variables using
transformations created by transcan
. These functions are useful
for getting predicted values with predictors set to values on the original
scale.
Varcov
methods are defined here so that imputation-corrected
variance-covariance matrices are readily extracted from
fit.mult.impute
objects, and so that fit.mult.impute
can easily
compute traditional covariance matrices for individual completed
datasets. Specific Varcov
methods are defined for lm
,
glm
, and multinom
fits.
The subscript function preserves attributes.
The invertTabulated
function does either inverse linear
interpolation or uses sampling to sample qualifying x-values having
y-values near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a many-to-one transformation in that region. Sampling
weights are a combination of the frequency of occurrence of x-values
that are within tolInverse
times the range of y
and the squared
distance between the associated y-values and the target y-value (aty
).
transcan(x, method=c("canonical","pc"), categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute, boot.method=c('approximate bayesian', 'simple'), trantab=FALSE, transformed=FALSE, impcat=c("score", "multinom", "rpart", "tree"), mincut=40, inverse=c('linearInterp','sample'), tolInverse=.05, pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE, imputed.actual=c('none','datadensity','hist','qq','ecdf'), iter.max=50, eps=.1, curtail=TRUE, imp.con=FALSE, shrink=FALSE, init.cat="mode", nres=if(boot.method=='simple')200 else 400, data, subset, na.action, treeinfo=FALSE, rhsImp=c('mean','random'), details.impcat='', ...) ## S3 method for class 'transcan': summary(object, long=FALSE, ...) ## S3 method for class 'transcan': print(x, long=FALSE, ...) ## S3 method for class 'transcan': plot(x, ...) ## S3 method for class 'transcan': impute(x, var, imputation, name, where.in, data, where.out=1, frame.out, list.out=FALSE, pr=TRUE, check=TRUE, ...) fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE, derived, pr=TRUE, subset, ...) ## S3 method for class 'transcan': predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE, type=c("transformed","original"), inverse, tolInverse, check=FALSE, ...) Function(object, ...) ## S3 method for class 'transcan': Function(object, prefix=".", suffix="", where=1, ...) invertTabulated(x, y, freq=rep(1,length(x)), aty, name='value', inverse=c('linearInterp','sample'), tolInverse=0.05, rule=2) Varcov(object, ...) ## Default S3 method: Varcov(object, regcoef.only=FALSE, ...) ## S3 method for class 'lm': Varcov(object, ...) ## S3 method for class 'glm': Varcov(object, ...) ## S3 method for class 'multinom': Varcov(object, ...) ## S3 method for class 'fit.mult.impute': Varcov(object, ...)
x |
a matrix containing continuous variable values and codes for categorical
variables. The matrix must have column names (dimnames ). If row
names are present, they are used in forming the names attribute
of imputed values if imputed=TRUE . x may also be a formula, in which
case the model matrix is created automatically, using data in the calling
frame. Advantages of using a formula are that categorical variables
can be determined automatically by a variable being a factor
variable, and variables with two unique levels are modeled asis .
Variables with 3 unique values are considered to be categorical if
a formula is specified. For a formula you may also specify that a
variable is to remain untransformed by enclosing its name with the
identify function, e.g. I(x3) . The user may add other variable names to the
asis and categorical vectors. For invertTabulated , x is a
vector or a list with three components: the x vector, the
corresponding vector of transformed values, and the corresponding
vector of frequencies of the pair of original and transformed variables.
For print , plot , impute , and
predict , x is an object created by transcan .
|
formula |
any S model formula |
fitter |
any S or Design modeling function (not in quotes) that computes a
vector of coefficients and for which Varcov will return a
variance-covariance matrix. E.g., fitter=lm, glm, ols . At present models
involving non-regression parameters (e.g., scale parameters in
parametric survival models) are not handled fully.
|
xtrans |
an object created by transcan , aregImpute , or Mice
|
method |
use method="canonical" or any abbreviation thereof, to use canonical
variates (the default).
method="pc" transforms a variable instead so as to maximize
the correlation with the first principal component of the other
variables.
|
categorical |
a character vector of names of variables in x which are categorical,
for which the ordering of re-scored values is not necessarily preserved.
If categorical is omitted, it is assumed that all variables are
continuous (or binary). Set categorical="*" to treat all variables
as categorical.
|
asis |
a character vector of names of variables that are not to be transformed.
For these variables, the guts of lm.fit.qr is used to impute missing values.
You may want to treat binary variables asis (this is automatic if
using a formula). If imputed=TRUE, you
may want to use "categorical" for binary variables if you want
to force imputed values to be one of the original data values.
Set asis="*" to treat all variables asis .
|
nk |
number of knots to use in expanding each continuous variable (not listed
in asis ) in a restricted cubic spline function. Default is 3 (yielding
2 parameters for a variable) if n < 30 , 4 if 30 <= n < 100 , and 5 if
n >= 100 (4 parameters).
|
imputed |
Set to TRUE to return a list containing imputed values on the original
scale.
If the transformation for a variable is non-monotonic, imputed
values are not unique. transcan uses the approx function,
which returns the highest value of the variable with the transformed
score equalling the imputed score. imputed=TRUE also causes original-scale imputed values to be shown as tick
marks on the top margin of each graph
when show.na=TRUE (for the final iteration only).
For categorical predictors, these imputed values are jitter ed so
that their frequencies can be visualized. When n.impute is used,
each NA will have n.impute tick marks.
|
n.impute |
number of multiple imputations. If omitted, single predicted expected
value imputation is used. n.impute=5 is frequently recommended.
|
boot.method |
default is to use the approximate Bayesian bootstrap (sample with
replacement from sample with replacement of the vector of residuals).
You can also specify boot.method="simple" to use the usual
bootstrap one-stage sampling with replacement.
|
trantab |
Set to TRUE to add an attribute trantab to the returned matrix. This
contains a vector of lists each with components x and y containing
the unique values and corresponding transformed values for the
columns of x . This is set up to be used easily with the approx
function. You must specify trantab=TRUE if you want to later use the
predict.transcan function with type="original" .
|
transformed |
set to TRUE to cause transcan to return an object transformed
containing the matrix of transformed variables
|
impcat |
This argument tells how to impute categorical variables on the original
scale.
The default is impcat="score" to impute the category
whose canonical variate score is closest to the predicted score.
Use impcat="tree" to impute categorical variables using the
tree() function, using the values of all other transformed
predictors. impcat="rpart" will use rpart . A better but somewhat
slower approach is to use impcat="multinom" to fit a multinomial
logistic model to the categorical variable, at the last iteraction of
the transcan algorithm. This uses the multinom function in the
nnet library of the MASS package (which is assumed to have been
installed by the user) to fit a polytomous logistic model to the
current working transformations of all the other variables (using
conditional mean imputation for missing predictors). Multiple
imputations are made by drawing multinomial values from the vector of
predicted probabilities of category membership for the missing
categorical values.
|
mincut |
If imputed=TRUE , there are categorical variables, and impcat="tree" ,
mincut specifies the lowest node size that will be allowed to be
split by tree . The default is 40.
|
inverse |
By default, imputed values are back-solved on the original scale using
inverse linear interpolation on the fitted tabulated transformed values.
This will cause distorted distributions of imputed values (e.g., floor
and ceiling effects) when the estimated transformation has a flat or
nearly flat section. To instead use the invertTabulated function
(see above) with the "sample" option, specify inverse="sample" .
|
tolInverse |
the multiplyer of the range of transformed values, weighted by freq
and by the distance measure, for determining the set of x
values having y values within a tolerance of the value of aty in
invertTabulated . For predict.transcan , inverse and
tolInverse are obtained from options that were specified to
transcan by default. Otherwise, if not specified by the user, these
default to the defaults used to invertTabulated .
|
pr |
For transcan , set to FALSE to suppress printing r-squares
and shrinkage factors. For impute.transcan set to FALSE
to suppress messages concerning the number of NAs imputed, or for
fit.mult.impute set to FALSE to suppress printing variance
inflation factors accounting for imputation, rate of missing
information, and degrees of freedom.
|
pl |
Set to FALSE to suppress plotting the final transformations with
distribution of scores for imputed values (if show.na=TRUE ).
|
allpl |
Set to TRUE to plot transformations for intermediate iterations.
|
show.na |
Set to FALSE to suppress the distribution of scores assigned to
missing values (as tick marks on the right margin of each graph).
See also imputed .
|
imputed.actual |
The default is "none" to suppress plotting of actual vs. imputed
values for all variables having any NAs. Other choices are
"datadensity" to use datadensity to make a single plot, "hist"
to make a series of back-to-back histograms, "qq" to make a series
of q-q plots, or "ecdf" to make a series of empirical cdfs. For
imputed.actual="datadensity" for example you get
a rug plot of the non-missing values for the variable with beneath it
a rug plot of the imputed values.
When imputed.actual is not "none" , imputed is automatically set
to TRUE .
|
iter.max |
maximum number of iterations to perform for transcan or predict .
For predict , only one iteration is used if there
are no NAs in the data or if imp.con was used.
|
eps |
convergence criterion for transcan and predict . eps is the
maximum change in transformed values from one iteration to the next.
If for a given iteration all new transformations
of variables differ by less than eps (with or without negating the
transformation to allow for "flipping") from the transformations in
the previous iteration, one more iteration is done for transcan .
During this
last iteration, individual transformations are not updated but
coefficients of transformations are. This improves stability of
coefficients of canonical variates on the right-hand-side.
eps is ignored when rhsImp="random" .
|
curtail |
for transcan , causes imputed values on the transformed scale to
be truncated so that their ranges are within the ranges of
non-imputed transformed values.
For predict , curtail defaults to TRUE to truncate predicted transformed
values to their ranges in the original fit (xt ).
|
imp.con |
for transcan , set to TRUE to impute NAs on the original scales with
constants (medians or most frequent category codes). Set to a vector
of constants to instead always use these constants for imputation.
These imputed values are ignored when fitting the current working
transformation for a single variable.
|
shrink |
default is FALSE to use ordinary least squares or canonical variate estimates.
For the purposes of imputing NAs, you may want to set shrink=TRUE to avoid
overfitting when developing a prediction equation to predict each variables
from all the others (see details below).
|
init.cat |
method for initializing scorings of categorical variables. Default is
"mode" to use a dummy variable set to 1 if the value is the most
frequent value (this is the default).
Use "random" to use a random 0-1 variable. Set
to "asis" to use the original integer codes as starting scores.
|
nres |
number of residuals to store if n.impute is specified. If the
dataset has fewer than nres observations, all residuals are saved.
Otherwise a random sample of the residuals of length nres without
replacement is saved. The default for nres is higher if
boot.method="approximate bayesian" .
|
data |
|
subset |
an integer or logical vector specifying the subset of observations to fit |
na.action |
These may be used if x is a formula. The default na.action is
na.retain (defined by transcan ) which keeps all observations with
any NA s.
For impute.transcan , data is a data frame to use as the source of
variables to be imputed, rather than using where.in . For
fit.mult.impute , data is mandatory and is a data frame containing
the data to be used in fitting the model but before imputations
are applied. Variables omitted from data are assumed to be
available from frame 1 and do not need to be imputed.
|
treeinfo |
Set to TRUE to get additional information printed when impcat="tree" ,
such as the predicted probabilities of category membership.
|
rhsImp |
Set to "random" to use random draw imputation when a sometimes
missing variable is moved to be a predictor of other sometimes missing
variables. Default is rhsImp="mean" , which uses conditional mean
imputation on the transformed scale. Residuals used are residuals
from the transformed scale. When "random" is used, transcan runs
5 iterations and ignores eps .
|
details.impcat |
set to a character scalar that is the name of a
category variable to include in the resulting transcan object
an element details.impcat containing details of how the
categorical variable was multiply imputed. |
... |
arguments passed to scat1d or to the fitter function (for
fit.mult.impute )
|
long |
for summary , set to TRUE to print all imputed values.
For print , set to TRUE to print details of transformations/imputations.
|
var |
For impute , is a variable that was originally a column in x , for
which imputated values are to be filled in. imputed=TRUE must have been
used in transcan . Omit var to impute all variables, creating new
variables in search position where .
|
imputation |
specifies which of the multiple imputations to use for filling in NAs |
name |
name of variable to impute, for impute() . Default is character
string version of the second argument (var ) in the call to
impute . For invertTabulated , is the name of variable being
transformed (used only for warning messages).
|
where.in |
location in search list to find variables that need to be imputed, when
all variables are to be imputed automatically by impute.transcan
(i.e., when no input variable name is specified).
Default is first search position that contains the first variable to
be imputed.
|
where.out |
location in the search list for storing variables with missing values
set to imputed values, for impute.transcan when all variables with
missing values are being imputed automatically.
|
frame.out |
Instead of specifying where.out you can specify an S frame
number into which individual new imputed variables will be written.
For example, frame.out=1 is useful for putting new variables into a
temporary local frame when impute is called within another function
(see fit.mult.impute ). See assign for details about
frames. For R, where.out and frame.out are ignored and
results are stored in .GlobalEnv when list.out is not
specified (it is recommended to use list.out=TRUE ).
|
list.out |
If var is not specified, you can set list.out=TRUE to have
impute.transcan return a list containing variables with needed
values imputed. This list will contain a single imputation.
|
check |
set to FALSE to suppress certain warning messages
|
newdata |
a new data matrix for which to compute transformed variables.
Categorical variables must use the same integer codes as were used
in the call to transcan . If a formula was originally specified to
transcan (instead of a data matrix), newdata is optional and if
given must be a data frame; a model
frame is generated automatically from the previous formula. The
na.action is handled automatically, and the levels for factor variables
must be the same and in the same order as were used in the original
variables specified in the formula given to transcan .
|
fit.reps |
set to TRUE to save all fit objects from the fit for each imputation in
fit.mult.impute . Then the object returned will have a component
fits which is a list whose i th element is the i th fit object.
|
derived |
an expression containing S expressions for computing derived
variables that are used in the model formula. This is useful when
multiple imputations are done for component variables but the actual
model uses combinations of these (e.g., ratios or other derivations).
For a single derived variable you can specified for example
derived=expression(ratio <- weight/height) . For multiple derived
variables use the form derived=expression({ratio <- weight/height;
product <- weight*height}) or put the expression on separate input
lines. To monitor the multiply-imputed derived
variables you can add to the expression a command such as
print(describe(ratio)) . See the example below.
|
type |
By default, the matrix of transformed variables is returned, with imputed
values on the transformed scale. If you had specified trantab=TRUE to
transcan , specifying type="original" does the table look-ups with
linear interpolation to return the input matrix x but with imputed
values on the original scale inserted for NAs. For categorical variables,
the method used here is to select
the category code having a corresponding scaled value closest to the
predicted transformed value. This corresponds to the default impcat ;
a problem in getting predicted
values for tree objects prevented using tree for this. Note:
imputed values thus returned when type="original" are single
expected value imputations even in n.impute is given.
|
object |
an object created by transcan , or an object to be
converted to S function code, typically a model fit object of some sort |
prefix |
|
suffix |
When creating separate S functions for each variable in x , the name
of the new function will be prefix placed in front of the variable name,
and suffix placed in back of the name. The default is to use names
of the form .varname , where varname is the variable name.
|
where |
position in search list at which to store new functions (for Function ).
Default is position 1 in the search list. See the assign function for more
documention on the where argument.
|
y |
a vector corresponding to x for invertTabulated , if its first
argument x is not a list
|
freq |
a vector of frequencies corresponding to cross-classified x and y
if x is not a list. Default is a vector of ones.
|
aty |
vector of transformed values at which inverses are desired |
rule |
see approx . transcan assumes rule is always 2
|
regcoef.only |
set to TRUE to make Varcov.default
delete positions in the covariance matrix for any non-regression
coefficients (e.g., log scale parameter from psm or survreg ) |
The starting approximation to the transformation for each variable
is taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of
the non-missing values for the variable (for continuous ones) or
the most frequent category (for categorical ones). Instead, if imp.con
is
a vector, its values are used for imputing NAs. When using each
variable as a dependent variable, NAs on that variable cause all
observations to be temporarily deleted. Once a new working transformation
is found for the variable, along with a model to predict that transformation
from all the other variables, that latter model is used to impute
NAs in the selected dependent variable if imp.con
is not specified.
When that variable is used
to predict a new dependent variable, the current working imputed values
are inserted. Transformations are updated after each variable becomes
a dependent variable, so the order of variables on x
could conceivably
make a difference in the final estimates. For obtaining out-of-sample
predictions/transformations, predict
uses the same iterative
procedure as transcan
for imputation, with the same starting
values for fill-ins as were used by transcan
. It also (by default)
uses a conservative approach of curtailing transformed variables to
be within the range of the original ones.
Even when method="pc"
is specified, canonical variables are used
for imputing missing values.
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the transformed
imputed values returned in xt
. This is because transcan
uses an
approximate method based on linear interpolation to back-solve for
imputed values on the original scale.
Shrinkage uses the method of Van Houwelingen and Le Cessie (1990) (similar to
Copas, 1983). The shrinkage factor is [1-(1-R2)(n-1)/(n-k-1)]/R2
, where
R2
is the apparent R-squared for predicting the variable, n
is the number
of non-missing values, and k
is the effective number of degrees of freedom
(aside from intercepts). A heuristic estimate is used for k
:
A - 1 + sum(max(0,Bi-1))/m + m
, where
A
is the number of d.f. required
to represent the variable being predicted, the Bi
are the number of
columns required to represent all the other variables, and m
is the
number of all other variables. Division by m
is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The + m
term
comes from the number of coefficients estimated on the right hand side,
whether by least squares or canonical variates. If a shrinkage factor
is negative, it is set to 0. The shrinkage factor is the ratio of
the adjusted R-squared to the ordinary R-squared.
The adjusted R-squared is 1 - (1 - R2)(n-1)/(n-k-1)
, which is also set to
zero if it is negative. If shrink=FALSE
and the adjusted R-squares are much
smaller than
the ordinary R-squares, you may want to run transcan
with shrink=TRUE
.
Canonical variates are scaled to have variance of 1.0, by multiplying canonical
coefficients from cancor
by sqrt(n-1)
.
When specifying a non-Design library fitting function to
fit.mult.impute
(e.g., lm
, glm
), running the result of
fit.mult.impute
through that fit's summary
method will not use the
imputation-adjusted variances. You may obtain the new variances using
fit$var
or Varcov(fit)
.
When you specify a Design function to fit.mult.impute
(e.g.,
lrm, ols, cph, psm, bj
), automatically computed transformation
parameters (e.g., knot locations for rcs
) that are estimated for the
first imputation are used for all other imputations. This ensures
that knot locations will not vary, which would change the meaning of
the regression coefficients.
Warning: even though fit.mult.impute
takes imputation into account
when estimating variances of regression coefficient, it does not take
into account the variation that results from estimation of the shapes
and regression coefficients of the customized imputation equations.
Specifying shrink=TRUE
solves a small part of this problem. To fully
account for all sources of variation you should consider putting the
transcan
invocation inside a bootstrap or loop, if execution time
allows. Better still, use aregImpute
or one of the libraries such
as MICE that uses real Bayesian posterior realizations to multiply
impute missing values correctly.
It is strongly recommended that you use the Hmisc naclus
function to
determine is there is a good basis for imputation. naclus
will tell
you, for example, if systolic blood pressure is missing whenever
diastolic blood pressure is missing. If the only variable that is
well correlated with diastolic bp is systolic bp, there is no basis
for imputing diastolic bp in this case.
At present, predict
does not work with multiple imputation.
When calling fit.mult.impute
with glm
as the fitter
argument, if
you need to pass a family
argument to glm
do it by quoting the
family, e.g., family="binomial"
.
You should be able to use a variable in the formula given to
fit.mult.impute
as a numeric variable in the regression model even
though it was a factor variable in the invocation of transcan
. Use
for example fit.mult.impute(y ~ codes(x), lrm, trans)
(thanks to
Trevor Thompson trevor@hp5.eushc.org).
For transcan
, a list of class transcan
with elements
call
(with the function call), iter
(number of
iterations done) and rsq
and rsq.adj
containing the R-squares and
adjusted R-squares achieved in predicting each variable from all the
others. It also has elements categorical
, asis
, coef
,
xcoef
, parms
, fillin
, ranges
, scale
, and formula
containing respectively the values supplied for categorical
and
asis
, the within-variable coefficients used to compute the first
canonical variate, the (possibly shrunk) across-variables coefficients
of the first canonical variate that predicts each variable in turn,
the parameters of the transformation (knots for splines, contrast
matrix for categorical variables), the initial estimates for missing
values (NA if variable never missing), the matrix of ranges of the
transformed variables (min and max in first and second row), a vector
of scales used to determine convergence for a transformation, the
formula (if x
was a formula), and optionally a vector of shrinkage
factors used for predicting each variable from the others. For
"asis"
variables, the scale is the average absolute difference about
the median. For other variables it is unity, since canonical
variables are standardized. For xcoef
, row i
has the coefficients
to predict transformed variable i
, with the column for the
coefficient of variable i
set to NA. If imputed=TRUE
was given, an
optional element imputed
also appears. This is a list with the
vector of imputed values (on the original scale) for each variable
containing NAs. Matrices rather than vectors are returned if
n.impute
is given. If trantab=TRUE, the `trantab
element also
appears, as described above. If n.impute > 0
, transcan
also returns
a list residuals
that can be used for future multiple imputation.
impute
returns a vector (the same
length as var
) of class "impute"
with NAs imputed. predict
returns a matrix with the same number of columns or variables as were
in x
.
fit.mult.impute
returns a fit object that is a modification of the
fit object created by fitting the completed dataset for the final
imputation. The var
matrix in the fit object has the
imputation-corrected variance-covariance matrix. coefficients
is
the average (over imputations) of the coefficient vectors,
variance.inflation.impute
is a vector containing the ratios of
the diagonals of the between-imputation variance matrix to the diagonals
of the average apparent (within-imputation) variance matrix.
missingInfo
is Rubin's "rate of missing information" and
dfmi
is Rubin's degrees of freedom for a t-statistic for testing
a single parameter. The last two objects are vectors corresponding to
the diagonal of the variance matrix.
prints, plots, and impute.transcan
creates new variables.
Frank Harrell
Department of Biostatistics
Vanderbilt University
f.harrell@vanderbilt.edu
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth Edition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation. Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in health-care databases: An overview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidem 55:184–191, 2002.
aregImpute
, impute
, naclus
, naplot
,
ace
, avas
, cancor
, prcomp
, rcspline.eval
,
lsfit
, approx
, datadensity
, mice
## Not run: x <- cbind(age, disease, blood.pressure, pH) #cbind will convert factor object `disease' to integer par(mfrow=c(2,2)) x.trans <- transcan(x, categorical="disease", asis="pH", transformed=TRUE, imputed=TRUE) summary(x.trans) #Summary distribution of imputed values, and R-squares f <- lm(y ~ x.trans$transformed) #use transformed values in a regression #Now replace NAs in original variables with imputed values, if not #using transformations age <- impute(x.trans, age) disease <- impute(x.trans, disease) blood.pressure <- impute(x.trans, blood.pressure) pH <- impute(x.trans, pH) #Do impute(x.trans) to impute all variables, storing new variables under #the old names summary(pH) #uses summary.impute to tell about imputations #and summary.default to tell about pH overall # Get transformed and imputed values on some new data frame xnew newx.trans <- predict(x.trans, xnew) w <- predict(x.trans, xnew, type="original") age <- w[,"age"] #inserts imputed values blood.pressure <- w[,"blood.pressure"] Function(x.trans) #creates .age, .disease, .blood.pressure, .pH() #Repeat first fit using a formula x.trans <- transcan(~ age + disease + blood.pressure + I(pH), imputed=TRUE) age <- impute(x.trans, age) predict(x.trans, expand.grid(age=50, disease="pneumonia", blood.pressure=60:260, pH=7.4)) z <- transcan(~ age + factor(disease.code), # disease.code categorical transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE) plot(z$transformed) ## End(Not run) # Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation set.seed(1) x1 <- factor(sample(c('a','b','c'),100,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100) y <- x2 + 1*(x1=='c') + rnorm(100) x1[1:20] <- NA x2[18:23] <- NA d <- data.frame(x1,x2,y) n <- naclus(d) plot(n); naplot(n) # Show patterns of NAs f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d) options(digits=3) summary(f) f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d) summary(f) h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d) # Add ,fit.reps=TRUE to save all fit objects in h, then do something like: # for(i in 1:length(h$fits)) print(summary(h$fits[[i]])) diag(Varcov(h)) h.complete <- lm(y ~ x1 + x2, na.action=na.omit) h.complete diag(Varcov(h.complete)) # Note: had Design's ols function been used in place of lm, any # function run on h (anova, summary, etc.) would have automatically # used imputation-corrected variances and covariances # Example demonstrating how using the multinomial logistic model # to impute a categorical variable results in a frequency # distribution of imputed values that matches the distribution # of non-missing values of the categorical variable ## Not run: set.seed(11) x1 <- factor(sample(letters[1:4], 1000,TRUE)) x1[1:200] <- NA table(x1)/sum(table(x1)) x2 <- runif(1000) z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom') table(z$imputed$x1)/sum(table(z$imputed$x1)) ## End(Not run) # Example where multiple imputations are for basic variables and # modeling is done on variables derived from these set.seed(137) n <- 400 x1 <- runif(n) x2 <- runif(n) y <- x1*x2 + x1/(1+x2) + rnorm(n)/3 x1[1:5] <- NA d <- data.frame(x1,x2,y) w <- transcan(~ x1 + x2 + y, n.impute=5, data=d) # Add ,show.imputed.actual for graphical diagnostics ## Not run: g <- fit.mult.impute(y ~ product + ratio, ols, w, data=data.frame(x1,x2,y), derived=expression({ product <- x1*x2 ratio <- x1/(1+x2) print(cbind(x1,x2,x1*x2,product)[1:6,])})) ## End(Not run) # Here's a method for creating a permanent data frame containing # one set of imputed values for each variable specified to transcan # that had at least one NA, and also containing all the variables # in an original data frame. The following is based on the fact # that the default output location for impute.transcan is # given by where.out=1 (search position 1) ## Not run: xt <- transcan(~. , data=mine, imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE) attach(mine, pos=1, use.names=FALSE) impute(xt, imputation=1) # use first imputation # omit imputation= if using single imputation detach(1, 'mine2') ## End(Not run) # Example of using invertTabulated outside transcan x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,4,5,5,5,5,9,10) freq <- c(1,1,1,1,1,2,3,4,1,1) # x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5 # Within a tolerance of .05*(10-1) all y's match exactly # so the distance measure does not play a role set.seed(1) # so can reproduce for(inverse in c('linearInterp','sample')) print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse))) # Test inverse='sample' when the estimated transformation is # flat on the right. First show default imputations set.seed(3) x <- rnorm(1000) y <- pmin(x, 0) x[1:500] <- NA for(inverse in c('linearInterp','sample')) { par(mfrow=c(2,2)) w <- transcan(~ x + y, imputed.actual='hist', inverse=inverse, curtail=FALSE, data=data.frame(x,y)) if(inverse=='sample') next # cat('Click mouse on graph to proceed\n') # locator(1) }