## STAT430 : ProjectTwo

Referers: Fall2007 :: (Remote :: Orphans :: Tree )
Dorman Wiki
Dorman Lab Wiki

### 0Project Two

##### 0Preliminaries
Let me summarize the bootstrap lectures.

Philosophy first.  Remember that the best way to estimate the sampling distribution a statistic $T ( x )$ would be to obtain multiple realizations of the random variable $T ( x )$ by producing multiple samples from the population and computing $T ( x )$ on each. Then, for example, a good unbiased estimate of variance would be obtained by applying R function var() to the vector of $T ( x )$ thus produced.  However, this is rarely feasible.  We cannot go and repeat a study 1000 times; we barely get enough money to do it once!

In the absence of abundant money, we turn to the poor person's solution: bs.  We PRETEND to redo the experiment 1000 times by resampling the original data from the first experiment $K$ times with replacement.  To get an estimate of $V a r [ T ( x ) ]$, we simply then pretend that these bs datasets result from multiple independent samples and apply var() to the bootstrap data sets.

In bootstrapping, you resample the original data $x$ with replacement to create bootstrap datasets $x i ⋅ , i = 1 , ... , K$.  From the original data, you compute some statistic $T ( x )$, for example the mle estimates $^ β$, that estimates a population parameter $θ$, for example the linear regression coefficients $β$.  From each bs dataset you can also estimate the statistic $T ( x i ⋅ ) : = T i ⋅$.  In some cases, you may also estimate the standard error of this estimate, $s ( x i ⋅ ) : = s i ⋅$, for example the standard errors of coefficients \beta produced by R.  Now, you have a list of bs data $T 1 ⋅ , ... , T K ⋅$ or, with standard error estimates, $( T 1 ⋅ , s 1 ⋅ ) , ... , ( T K ⋅ , s K ⋅ )$.  There are several different things you can do with this bs data.

1. Estimate the standard error of the statistic $T ( x )$.  Simply compute the sampler variance of the data points $T i ⋅$ and take square root.  This is distinct from the standard error estimate provided by R in summary(glm(...,data=x)) on the original data, but should be very similar when all linear model assumptions are satisfied and sample size is large.
2. Estimate confidence intervals for population parameter $θ$.  There are two different ways to do this, plus one introduced in the homework, which I won't summarize here.
1. bootstrap-t: Compute $R ( x i ⋅ ) = T ( x i ⋅ ) - T ( x ) S ( x i ⋅ ) : = R i ⋅$ for each bs data set, rank, choose limits $R L$ and $R U$, reverse map to $T L$ and $T U$.
2. bootstrap percentile method:  Rank $T I ⋅$, choose $T L$ and $T U$; they will be your CI

##### 0Introduction
In this project, you will study the risk factors that contribute to a person's risk for heart attack based on a sample of 100 individuals (heart attack victims have been sampled at a disproportionate rate so prediction of heart attack risk in average individuals would not be a valid use of this data set).  Risk factors considered include behavioral risks, physical traits, and genetic factors.  The cyp1A locus, for example, is a gene on chromosome 15 with various mutants associated with heart disease.  A SNP is a single nucleotide polymorphism.  These locations in the human genome are sites where there is person-to-person variability.  The SNP mentioned here on chromosome 9 is another genetic factor recently associated with heart disease.

get data
get 2nd data  (if first one gives you problems or if you haven't downloaded data yet, go straight here)

##### 0Description of Data
• reg.exer: does the individual engage in regular exercise?
• fr.and.veggies: does the individual frequently eat fruit and vegetables?
• smoker: is the individual a smoker?
• hi.blood.press: does the individual have a history of high blood pressure?
• stress: does the individual have persistently high psychological stress?
• cyp1A: does the individual have a mutant gene at the cyp1A locus?
• snpc9: what genotype does the individual have on the chromosome 9 SNP known to be associated with heart disease?
• coffee: does the individual drink a low, medium, high number of cups of coffee each day?
• waist.hip.ratio: the waist to hip circumference ratio of the individual
• apo.ratio: the ratio of apoA1 to apoB of the individual
• heart.att: has the individual experienced a heart attack?

##### 0Questions
1. Your starting model is heart.att ~ reg.exer + fr.and.veggies + smoker + hi.blood.press + stress + cyp1A:coffee + snpc9 + waist.hip.ratio + apo.ratio, which you have based on previous studies showing that each of these factors or covariates has a significant impact on heart disease.  Call this model the full model.  Use back regression to reduce this model to the simplest model with all included terms significant.  Call this model the reduced model.  Did all covariates that were insignificant (as measured by standard large sample approximation and the p-value reported by R) in the full model get excluded?  Did any covariates listed as significant in the full model get excluded?
Hints:
• Don't forget to use sep="," in read.table() because the data comes as a comma-delimited file.
• You are doing logistic regression, right?  glm(..., family=binomial)
• See step(..., direction="backward") and use the above model as your starting model.
2. Use bootstrap to estimate the standard error for each coefficient in the reduced model.  How well do they agree with the standard errors estimated by R?  Note:  Bootstrapping regression data consists of sampling individuals $( x i , y i )$, where $x i$ are the covariates and $y i$ is the response for individual $i$, with replacement.  For each sample, re-estimate the regression.  See R's function coef() for help on extracting the pseudosample statistics of interest.
Hints:
• Below is example code (you'll need to add additional code to it for answering questions 2 and 3), including a check for unconverged bootstrap model fits.  If you save it as file.R, you can run it like this:
$R --slave --no-save --quiet < file.R from the linux command line or > source("file.R") from the R prompt. • Please note that this code extracts all the data you'll need for Questions 2 and 3. [Updated as of 2007-12-06 5:18pm; it was OK before, but this is easier to reuse.] • Some people report problems with the code below in that the bootstrap sampling doesn't seem to work (maybe Windows machines behave differently?). Try instead, d.tmp <- d[sample(sample.size,replace=T),] m.bs[[i]] <- glm(m.r$formula, family = binomial, data = d.tmp)
d <- read.table("your file", sep=",")
m.f <- glm(heart.att ~ reg.exer + fr.and.veggies + smoker + hi.blood.press + stress + cyp1A:coffee + snpc9 + waist.hip.ratio + apo.ratio, data=d,family=binomial)

# And evals are done!
summary(m.f)
m.r <- step(m.f, direction="backward");
summary(m.r)

# sample size may vary, depending on data set
sample.size <- length(d[[1]])
# set to one if cyp1A:coffee is in reduced model
use.one.less <- 0
# number of bootstraps
K <- 1000
# standard errors bigger than the following will be indicative of unconverged results
too.big <- 100
# number of parameters in my model
df <- length(coef(m.r)) - use.one.less
# create my objects
m.bs <- list()
c.bs <- matrix(nrow=K,ncol=df)
s.bs <- matrix(nrow=K,ncol=df)

# loop until I get K good bs pseudo data sets
i <- 1
while(i<=K) {
# store model
m.bs[[i]] <- glm(m.r\$formula, family = binomial, data = d[sample(sample.size,replace=T),])
# extract beta estimates
c.bs[i,] <- coef(m.bs[[i]])[1:df]
# extract standard error estimates
s.bs[i,] <- sqrt(diag(vcov(m.bs[[i]])))[1:df]
# report progress
print(paste(i, ": ", max(s.bs[i,])))
# check for unconverged results (should not occur with large dataset 2)
if(max(s.bs[i,])>too.big) {
print(paste("repeating bootstrap ", i))
i <- i-1
}
i <- i + 1
}

3. Compute confidence intervals for the coefficient or effect of one covariate or factor with borderline significance using several different methods:
1. Normal approximation with R-estimated standard errors.
2. Normal approximate using bootstrap-estimated standard errors.
3. Bootstrap-t
Hints:
• see above code for the data you'll need
• see R function vcov to get R's estimates of the variances of the coefficients
• After ranking the $R i ⋅$ and choosing the lower and upper limits $R L ⋅ , R U ⋅$, the CI is $T ( x ) + V ( x ) R L ⋅ , T ( x ) + V ( x ) R U ⋅$.
4. Bootstrap quantiles (Note: This method sample with replacement, re-estimates the model, then sorts estimates of the coefficient in question and peels off lower and upper bound.)
Compare and contrast the confidence intervals you estimate.  If there are differences, give reasons why they might differ.
Note: If you save your results from question 2, you need not resample in this question.