## STAT430 : ProjectTwoReferers: Fall2007 :: (Remote :: Orphans :: Tree ) |
Dorman Wiki
Dorman Lab Wiki |

Philosophy first. Remember that the best way to estimate the sampling distribution a statistic $T\left(x\right)$ would be to obtain multiple realizations of the random variable $T\left(x\right)$ by producing multiple samples from the population and computing $T\left(x\right)$ on each. Then, for example, a good unbiased estimate of variance would be obtained by applying R function var() to the vector of $T\left(x\right)$ thus produced. However, this is rarely feasible. We cannot go and repeat a study 1000 times; we barely get enough money to do it once!

In the absence of abundant money, we turn to the poor person's solution: bs. We PRETEND to redo the experiment 1000 times by resampling the original data from the first experiment $K$ times with replacement. To get an estimate of $Var\left[T\left(x\right)\right]$, we simply then pretend that these bs datasets result from multiple independent samples and apply var() to the bootstrap data sets.

In bootstrapping, you resample the original data $x$ with replacement to create bootstrap datasets ${x}_{i}^{\cdot},i=1,...,K$. From the original data, you compute some statistic $T\left(x\right)$, for example the mle estimates $\hat{}\beta $, that estimates a population parameter $\theta $, for example the linear regression coefficients $\beta $. From each bs dataset you can also estimate the statistic $T\left({x}_{i}^{\cdot}\right):={T}_{i}^{\cdot}$. In some cases, you may also estimate the standard error of this estimate, $s\left({x}_{i}^{\cdot}\right):={s}_{i}^{\cdot}$, for example the standard errors of coefficients \beta produced by R. Now, you have a list of bs data ${T}_{1}^{\cdot},...,{T}_{K}^{\cdot}$ or, with standard error estimates, $({T}_{1}^{\cdot},{s}_{1}^{\cdot}),...,({T}_{K}^{\cdot},{s}_{K}^{\cdot})$. There are several different things you can do with this bs data.

- Estimate the standard error of the statistic $T\left(x\right)$. Simply compute the sampler variance of the data points ${T}_{i}^{\cdot}$ and take square root. This is distinct from the standard error estimate provided by R in summary(glm(...,data=x)) on the original data, but should be very similar when all linear model assumptions are satisfied and sample size is large.
- Estimate confidence intervals for population parameter $\theta $. There are two different ways to do this, plus one introduced in the homework, which I won't summarize here.
- bootstrap-t: Compute $R\left({x}_{i}^{\cdot}\right)=\frac{T\left({x}_{i}^{\cdot}\right)-T\left(x\right)}{S\left({x}_{i}^{\cdot}\right)}:={R}_{i}^{\cdot}$ for each bs data set, rank, choose limits ${R}_{L}$ and ${R}_{U}$, reverse map to ${T}_{L}$ and ${T}_{U}$.
- bootstrap percentile method: Rank ${T}_{I}^{\cdot}$, choose ${T}_{L}$ and ${T}_{U}$; they will be your CI

get data

get 2nd data (if first one gives you problems or if you haven't downloaded data yet, go straight here)

- reg.exer: does the individual engage in regular exercise?
- fr.and.veggies: does the individual frequently eat fruit and vegetables?
- smoker: is the individual a smoker?
- hi.blood.press: does the individual have a history of high blood pressure?
- stress: does the individual have persistently high psychological stress?
- cyp1A: does the individual have a mutant gene at the cyp1A locus?
- snpc9: what genotype does the individual have on the chromosome 9 SNP known to be associated with heart disease?
- coffee: does the individual drink a low, medium, high number of cups of coffee each day?
- waist.hip.ratio: the waist to hip circumference ratio of the individual
- apo.ratio: the ratio of apoA1 to apoB of the individual
- heart.att: has the individual experienced a heart attack?

- Your starting model is heart.att ~ reg.exer + fr.and.veggies + smoker + hi.blood.press + stress + cyp1A:coffee + snpc9 + waist.hip.ratio + apo.ratio, which you have based on previous studies showing that each of these factors or covariates has a significant impact on heart disease. Call this model the
*full model*. Use back regression to reduce this model to the simplest model with all included terms significant. Call this model the*reduced model*. Did all covariates that were insignificant (as measured by standard large sample approximation and the p-value reported by R) in the full model get excluded? Did any covariates listed as significant in the full model get excluded?**Hints:**

- Don't forget to use sep="," in read.table() because the data comes as a comma-delimited file.
- You are doing logistic regression, right? glm(..., family=binomial)
- See step(..., direction="backward") and use the above model as your starting model.

- Use bootstrap to estimate the standard error for each coefficient in the
*reduced model*. How well do they agree with the standard errors estimated by R?**Note:**Bootstrapping regression data consists of sampling individuals $({x}_{i},{y}_{i})$, where ${x}_{i}$ are the covariates and ${y}_{i}$ is the response for individual $i$, with replacement. For each sample, re-estimate the regression. See R's function coef() for help on extracting the pseudosample statistics of interest.**Hints:**

- Below is example code (you'll need to add additional code to it for answering questions 2 and 3), including a check for unconverged bootstrap model fits. If you save it as file.R, you can run it like this:
$ R --slave --no-save --quiet < file.Rfrom the linux command line or

> source("file.R")from the R prompt.

- Please note that this code extracts all the data you'll need for Questions 2
*and*3. [Updated as of 2007-12-06 5:18pm; it was OK before, but this is easier to reuse.] - Some people report problems with the code below in that the bootstrap sampling doesn't seem to work (maybe Windows machines behave differently?). Try instead,
d.tmp <- d[sample(sample.size,replace=T),]

m.bs[[i]] <- glm(m.r$formula, family = binomial, data = d.tmp)

d <- read.table("your file", sep=",") m.f <- glm(heart.att ~ reg.exer + fr.and.veggies + smoker + hi.blood.press + stress + cyp1A:coffee + snpc9 + waist.hip.ratio + apo.ratio, data=d,family=binomial) # And evals are done! print("Answer to Q1:") summary(m.f) m.r <- step(m.f, direction="backward"); summary(m.r) # sample size may vary, depending on data set sample.size <- length(d[[1]]) # set to one if cyp1A:coffee is in reduced model use.one.less <- 0 # number of bootstraps K <- 1000 # standard errors bigger than the following will be indicative of unconverged results too.big <- 100 # number of parameters in my model df <- length(coef(m.r)) - use.one.less # create my objects m.bs <- list() c.bs <- matrix(nrow=K,ncol=df) s.bs <- matrix(nrow=K,ncol=df) # loop until I get K good bs pseudo data sets i <- 1 while(i<=K) { # store model m.bs[[i]] <- glm(m.r$formula, family = binomial, data = d[sample(sample.size,replace=T),]) # extract beta estimates c.bs[i,] <- coef(m.bs[[i]])[1:df] # extract standard error estimates s.bs[i,] <- sqrt(diag(vcov(m.bs[[i]])))[1:df] # report progress print(paste(i, ": ", max(s.bs[i,]))) # check for unconverged results (should not occur with large dataset 2) if(max(s.bs[i,])>too.big) { print(paste("repeating bootstrap ", i)) i <- i-1 } i <- i + 1 }

- Below is example code (you'll need to add additional code to it for answering questions 2 and 3), including a check for unconverged bootstrap model fits. If you save it as file.R, you can run it like this:
- Compute confidence intervals for the coefficient or effect of one covariate or factor with borderline significance using several different methods:
- Normal approximation with R-estimated standard errors.
- Normal approximate using bootstrap-estimated standard errors.
- Bootstrap-t
**Hints:**

- see above code for the data you'll need
- see R function vcov to get R's estimates of the variances of the coefficients
- After ranking the ${R}_{i}^{\cdot}$ and choosing the lower and upper limits ${R}_{L}^{\cdot},{R}_{U}^{\cdot}$, the CI is $T\left(x\right)+\sqrt{V\left(x\right)}{R}_{L}^{\cdot},T\left(x\right)+\sqrt{V\left(x\right)}{R}_{U}^{\cdot}$.

- Bootstrap quantiles (
**Note:**This method sample with replacement, re-estimates the model, then sorts estimates of the coefficient in question and peels off lower and upper bound.)

**Note:**If you save your results from question 2, you need not resample in this question.

There is no comment on this page.
[Display comments/form]