R: Convert a SAS Dataset to an S Data Frame

sas.get {Hmisc}

R Documentation

Convert a SAS Dataset to an S Data Frame

Description

Converts a SAS dataset into an S data frame. You may choose to extract only a subset of variables or a subset of observations in the SAS dataset. You may have the function automatically convert PROC FORMAT-coded variables to factor objects. The original SAS codes are stored in an attribute called sas.codes and these may be added back to the levels of a factor variable using the code.levels function. Information about special missing values may be captured in an attribute of each variable having special missing values. This attribute is called special.miss, and such variables are given class special.miss. There are print, [], format, and is.special.miss methods for such variables. The chron function is used to set up date, time, and date-time variables. If using S-Plus 5 or 6 or later, the timeDate function is used instead. Under R, Dates is used for dates and chron for date-times. For times without dates, these still need to be stored in date-time format in POSIX. Such SAS time variables are given a major class of timePOSIXt and a format.timePOSIXt function so that the date portion (which will always be 1/1/1970) will not print by default. If a date variable represents a partial date (.5 added if month missing, .25 added if day missing, .75 if both), an attribute partial.date is added to the variable, and the variable also becomes a class imputed variable. The describe function uses information about partial dates and special missing values. There is an option to automatically uncompress (or gunzip) compressed SAS datasets.

Usage

sas.get(library, member, variables=character(0), ifs=character(0),
     format.library=library, id,
     dates.=c("sas","yymmdd","yearfrac","yearfrac2"),
     keep.log=TRUE, log.file="_temp_.log", macro=sas.get.macro,
     data.frame.out=existsFunction("data.frame"), clean.up=!.R., quiet=FALSE,
     temp=tempfile("SaS"), formats=TRUE, recode=formats,
     special.miss=FALSE, sasprog="sas", 
     as.is=.5, check.unique.id=TRUE, force.single=FALSE,
     where, uncompress=FALSE)

is.special.miss(x, code)

x[...]

## S3 method for class 'special.miss':
print(x, ...)

## S3 method for class 'special.miss':
format(x, ...)

sas.codes(object)

code.levels(object)

Arguments

`library`	character string naming the directory in which the dataset is kept.
`member`	character string giving the second part of the two part SAS dataset name. (The first part is irrelevant here - it is mapped to the UNIX directory name.)
`x`	a variable that may have been created by `sas.get` with `special.miss=T` or with `recode` in effect.
`variables`	vector of character strings naming the variables in the SAS dataset. The S dataset will contain only those variables from the SAS dataset. To get all of the variables (the default), an empty string may be given. It is a fatal error if any one of the variables is not in the SAS dataset. You can use `sas.contents` to get the variables in the SAS dataset. If you have retrieved a subset of the variables in the SAS dataset and which to retrieve the same list of variables from another dataset, you can program the value of `variables` - see one of the last examples.
`ifs`	a vector of character strings, each containing one SAS "subsetting if" statement. These will be used to extract a subset of the observations in the SAS dataset.
`format.library`	The UNIX directory containing the file formats.sct, which contains the definitions of the user defined formats used in this dataset. By default, we look for the formats in the same directory as the data. The user defined formats must be available (so SAS can read the data).
`formats`	Set `formats` to `F` to keep `sas.get` from telling the SAS macro to retrieve value label formats from `format.library`. When you do not specify `formats` or `recode`, `sas.get` will set `format` to `T` if a SAS format catalog (`.sct` or `.sc2`) file exists in `format.library`. Value label formats if present are stored as the `formats` attribute of the returned object (see below). A format is used if it is referred to by one or more variables in the dataset, if it contains no ranges of values (i.e., it identifies value labels for single values), and if it is a character format or a numeric format that is not used just to label missing values. If you set `recode` to `TRUE`, 1, or 2, `formats` defaults to `TRUE`. To fetch the values and labels for variable `x` in the dataset `d` you could type: `f <- attr(d$x, "format")` `formats <- attr(d, "formats")` `formats$f$values; formats$f$labels`
`recode`	This parameter defaults to `TRUE` if `formats` is `TRUE`. If it is `TRUE`, variables that have an appropriate format (see above) are recoded as `factor` objects, which map the values to the value labels for the format. Alternatively, set `recode` to 1 to use labels of the form value:label, e.g. 1:good 2:better 3:best. Set `recode` to 2 to use labels such as good(1) better(2) best(3). Since `sas.codes` and `code.levels` add flexibility, the usual choice for `recode` is `T` or `TRUE`.
`special.miss`	For numeric variables, any missing values are stored as NA in S. You can recover special missing values by setting `special.miss` to `TRUE`. This will cause the `special.miss` attribute and the `special.miss` class to be added to each variable that has at least one special missing value. Suppose that variable `y` was .E in observation 3 and .G in observation 544. The `special.miss` attribute for `y` then has the value `list(codes=c("E","G"),obs=c(3,544))` To fetch this information for variable `y` you would say for example `s <- attr(y, "special.miss")` `s$codes; s$obs` or use `is.special.miss(x)` or the `print.special.miss` method, which will replace `NA` values for the variable with `E` or `G` if they correspond to special missing values. The describe function uses this information in printing a data summary.
`id`	The name of the variable to be used as the row names of the S dataset. The id variable becomes the `row.names` attribute of a data frame, but the id variable is still retained as a variable in the data frame. (if `data.frame.out` is `FALSE`, this will be the attribute `"id"` of the S dataset.) You can also specify a vector of variable names as the `id` parameter. After fetching the data from SAS, all these variables will be converted to character format and concatenated (with a space as a separator) to form a (hopefully) unique ID variable.
`dates.`	specifies the format for storing SAS dates in the resulting data frame
`as.is`	IF `data.frame.out=T`, SAS character variables are converted to S factor objects if `as.is=F` or if `as.is` is a number between 0 and 1 inclusive and the number of unique values of the variable is less than the number of observations (`n`) times `as.is`. The default if `as.is` is .5, so character variables are converted to factors only if they have fewer than `n/2` unique values. The primary purpose of this is to keep unique identification variables as character values in the data frame instead of using more space to store both the integer factor codes and the factor labels.
`check.unique.id`	If `id` is specified, the row names are checked for uniqueness if `check.unique.id=T`. If any are duplicated, a warning is printed. Note that if a data frame is being created with duplicate row names, statements such as `my.data.frame["B23",]` will retrieve only the first row with a row name of `"B23"`.
`force.single`	By default, SAS numeric variables having `LENGTH`s > 4 are stored as S double precision numerics, which allow for the same precision as a SAS `LENGTH` 8 variable. Set `force.single=T` to store every numeric variable in single precision (7 digits of precision). This option is useful when the creator of the SAS dataset has failed to use a `LENGTH` statement. R does not have single precision, so no attempt is made to convert to single if running R.
`dates`	One of the character strings `"sas"`, `"yearfrac"`, `"yearfrac2"`, `"yymmdd"`. If a SAS variable has a date format (one of "DATE", "MMDDYY", "YYMMDD", "DDMMYY", "YYQ", "MONYY", "JULIAN"), it will be converted to the format specified by `dates` before being given to S. `"sas"` gives days from 1/1/1960 (from 1/1/1970 if using `chron`), `"yearfrac"` gives days from 1/1/1900 divided by 365.25, `"yearfrac2"` gives year plus fraction of current year, and `"yymmdd"` gives a 6 digit number YYMMDD (year%%100, month, day). Note that S will store these as numbers, not as character strings. If dates="sas" and a variable has one of the SAS date formats listed above, the variable will be given a class of "date" to work with Terry Therneau's implementation of the "date" class in S. If the `chron` package or `timeDate` function is available, these are used instead.
`keep.log`	logical flag: if `FALSE`, delete the SAS log file upon completion.
`log.file`	the name of the SAS log file.
`macro`	the name of an S object in the current search path that contains the text of the SAS macro called by S. The S object is a character vector that can be edited using for example sas.get.macro <- editor(sas.get.macro).
`data.frame.out`	logical flag: if `TRUE`, the return value will be an S data frame, otherwise it will be a list.
`clean.up`	logical flag: if `TRUE`, remove all temporary files when finished. You may want to keep these while debugging the SAS macro. Not needed for R.
`quiet`	logical flag: if `FALSE`, print the contents of the SAS log file if there has been an error.
`temp`	the prefix to use for the temporary files. Two characters will be added to this, the resulting name must fit on your file system.
`sasprog`	the name of the system command to invoke SAS
`uncompress`	set to `T` to automatically invoke the UNIX `gunzip` command (if `member.ssd01.gz` exists) or the `uncompress` command (if `member.ssd01.Z` exists) to uncompress the SAS dataset before proceeding. This assumes you have the file permissions to allow uncompressing in place. If the file is already uncompressed, this option is ignored.
`where`	by default, a list or data frame which contains all the variables is returned. If you specify `where`, each individual variable is placed into a separate object (whose name is the name of the variable) using the `assign` function with the `where` argument. For example, you can put each variable in its own file in a directory, which in some cases may save memory over attaching a data frame.
`code`	a special missing value code (A through Z or underscore) to check against. If `code` is omitted, `is.special.miss` will return a `T` for each observation that has any special missing value.
`object`	a variable in a data frame created by `sas.get`
`...`	ignored

Details

If you specify special.miss=T and there are no special missing values in the data SAS dataset, the SAS step will bomb.

For variables having a PROC FORMAT VALUE format with some of the levels undefined, sas.get will interpret those values as NA if you are using recode.

The SAS macro sas_get uses record lengths of up to 4096 in two places. If you are exporting records that are very long (because of a large number of variables and/or long character variables), you may want to edit these LRECLs to quadruple them, for example.

Value

if data.frame.out is TRUE, the output will be a data frame resembling the SAS dataset. If id was specified, that column of the data frame will be used as the row names of the data frame. Each variable in the data frame or vector in the list will have the attributes label and format containing SAS labels and formats. Underscores in formats are converted to periods. Formats for character variables have $ placed in front of their names. If formats is TRUE and there are any appropriate format definitions in format.library, the returned object will have attribute formats containing lists named the same as the format names (with periods substituted for underscores and character formats prefixed by $). Each of these lists has a vector called values and one called labels with the PROC FORMAT; VALUE ... definitions.
If data.frame.out is FALSE, the output will be a list of vectors, each containing a variable from the SAS dataset. If id was specified, that element of the list will be used as the id attribute of the entire list.

Side Effects

if a SAS error occurs and quiet is FALSE, then the SAS log file will be printed under the control of the less pager.

BACKGROUND

The references cited below explain the structure of SAS datasets and how they are stored under UNIX. See SAS Language for a discussion of the "subsetting if" statement.

Note

You must be able to run SAS (by typing sas) on your system. If the S command !sas does not start SAS, then this function cannot work.

If you are reading time or date-time variables, you will need to execute the command library(chron) to print those variables or the data frame if the timeDate function is not available.

Author(s)

Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corporation
Michael W. Kattan, Cleveland Clinic Foundation

References

SAS Institute Inc. (1990). SAS Language: Reference, Version 6. First Edition. SAS Institute Inc., Cary, North Carolina.

SAS Institute Inc. (1988). SAS Technical Report P-176, Using the SAS System, Release 6.03, under UNIX Operating Systems and Derivatives. SAS Institute Inc., Cary, North Carolina.

SAS Institute Inc. (1985). SAS Introductory Guide. Third Edition. SAS Institute Inc., Cary, North Carolina.

Examples

## Not run: 
sas.contents("saslib", "mice")
# [1] "dose"  "ld50"  "strain"  "lab_no"
attr(, "n"):
# [1] 117
mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50"))
plot(mice$dose, mice$ld50)

nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",
        ifs="if strain='nude'")

nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",
        var=c("dose", "ld50"), ifs="if strain='nude'")

# Get a dataset from current directory, recode PROC FORMAT; VALUE ... 
# variables into factors with labels of the form "good(1)" "better(2)",
# get special missing values, recode missing codes .D and .R into new
# factor levels "Don't know" and "Refused to answer" for variable q1
d <- sas.get(".", "mydata", recode=2, special.miss=TRUE)
attach(d)
nl <- length(levels(q1))
lev <- c(levels(q1), "Don't know", "Refused")
q1.new <- as.integer(q1)
q1.new[is.special.miss(q1,"D")] <- nl+1
q1.new[is.special.miss(q1,"R")] <- nl+2
q1.new <- factor(q1.new, 1:(nl+2), lev)
# Note: would like to use factor() in place of as.integer ... but
# factor in this case adds "NA" as a category level

d <- sas.get(".", "mydata")
sas.codes(d$x)    # for PROC FORMATted variables returns original data codes
d$x <- code.levels(d$x)   # or attach(d); x <- code.levels(x)
# This makes levels such as "good" "better" "best" into e.g.
# "1:good" "2:better" "3:best", if the original SAS values were 1,2,3

# Retrieve the same variables from another dataset (or an update of
# the original dataset)
mydata2 <- sas.get('mydata2', var=names(d))
# This only works if none of the original SAS variable names contained _
mydata2 <- cleanup.import(mydata2) # will make true integer variables

# Code from Don MacQueen to generate SAS dataset to test import of
# date, time, date-time variables
# data ssd.test;
#     d1='3mar2002'd ;
#     dt1='3mar2002 9:31:02'dt;
#     t1='11:13:45't;
#     output;
#
#     d1='3jun2002'd ;
#     dt1='3jun2002 9:42:07'dt;
#     t1='11:14:13't;
#     output;
#     format d1 mmddyy10. dt1 datetime. t1 time.;
# run;
## End(Not run)

[Package Hmisc version 3.0-10 Index]