Let me summarize the bootstrap lectures.
Philosophy first. Remember that the best way to estimate the sampling distribution a statistic
would be to obtain multiple realizations of the random variable
by producing multiple samples from the population and computing
on each. Then, for example, a good unbiased estimate of variance would be obtained by applying R function var() to the vector of
thus produced. However, this is rarely feasible. We cannot go and repeat a study 1000 times; we barely get enough money to do it once!
In the absence of abundant money, we turn to the poor person's solution: bs. We PRETEND to redo the experiment 1000 times by resampling the original data from the first experiment
times with replacement. To get an estimate of
, we simply then pretend that these bs datasets result from multiple independent samples and apply var() to the bootstrap data sets.
In bootstrapping, you resample the original data
with replacement to create bootstrap datasets
. From the original data, you compute some statistic
, for example the mle estimates
, that estimates a population parameter
, for example the linear regression coefficients
. From each bs dataset you can also estimate the statistic
. In some cases, you may also estimate the standard error of this estimate,
, for example the standard errors of coefficients \beta produced by R. Now, you have a list of bs data
or, with standard error estimates,
. There are several different things you can do with this bs data.
- Estimate the standard error of the statistic . Simply compute the sampler variance of the data points and take square root. This is distinct from the standard error estimate provided by R in summary(glm(...,data=x)) on the original data, but should be very similar when all linear model assumptions are satisfied and sample size is large.
- Estimate confidence intervals for population parameter . There are two different ways to do this, plus one introduced in the homework, which I won't summarize here.
- bootstrap-t: Compute for each bs data set, rank, choose limits and , reverse map to and .
- bootstrap percentile method: Rank , choose and ; they will be your CI
In this project, you will study the risk factors that contribute to a person's risk for heart attack based on a sample of 100 individuals (heart attack victims have been sampled at a disproportionate rate so prediction of heart attack risk in average individuals would not be a valid use of this data set). Risk factors considered include behavioral risks, physical traits, and genetic factors. The cyp1A locus, for example, is a gene on chromosome 15 with various mutants associated with heart disease. A SNP is a single nucleotide polymorphism. These locations in the human genome are sites where there is person-to-person variability. The SNP mentioned here on chromosome 9 is another genetic factor recently associated with heart disease.
get 2nd data
(if first one gives you problems or if you haven't downloaded data yet, go straight here)
0Description of Data
- reg.exer: does the individual engage in regular exercise?
- fr.and.veggies: does the individual frequently eat fruit and vegetables?
- smoker: is the individual a smoker?
- hi.blood.press: does the individual have a history of high blood pressure?
- stress: does the individual have persistently high psychological stress?
- cyp1A: does the individual have a mutant gene at the cyp1A locus?
- snpc9: what genotype does the individual have on the chromosome 9 SNP known to be associated with heart disease?
- coffee: does the individual drink a low, medium, high number of cups of coffee each day?
- waist.hip.ratio: the waist to hip circumference ratio of the individual
- apo.ratio: the ratio of apoA1 to apoB of the individual
- heart.att: has the individual experienced a heart attack?
- Your starting model is heart.att ~ reg.exer + fr.and.veggies + smoker + hi.blood.press + stress + cyp1A:coffee + snpc9 + waist.hip.ratio + apo.ratio, which you have based on previous studies showing that each of these factors or covariates has a significant impact on heart disease. Call this model the full model. Use back regression to reduce this model to the simplest model with all included terms significant. Call this model the reduced model. Did all covariates that were insignificant (as measured by standard large sample approximation and the p-value reported by R) in the full model get excluded? Did any covariates listed as significant in the full model get excluded?
- Don't forget to use sep="," in read.table() because the data comes as a comma-delimited file.
- You are doing logistic regression, right? glm(..., family=binomial)
- See step(..., direction="backward") and use the above model as your starting model.
- Use bootstrap to estimate the standard error for each coefficient in the reduced model. How well do they agree with the standard errors estimated by R? Note: Bootstrapping regression data consists of sampling individuals , where are the covariates and is the response for individual , with replacement. For each sample, re-estimate the regression. See R's function coef() for help on extracting the pseudosample statistics of interest.
- Compute confidence intervals for the coefficient or effect of one covariate or factor with borderline significance using several different methods:
Compare and contrast the confidence intervals you estimate. If there are differences, give reasons why they might differ.
- Normal approximation with R-estimated standard errors.
- Normal approximate using bootstrap-estimated standard errors.
- see above code for the data you'll need
- see R function vcov to get R's estimates of the variances of the coefficients
- After ranking the and choosing the lower and upper limits , the CI is .
- Bootstrap quantiles (Note: This method sample with replacement, re-estimates the model, then sorts estimates of the coefficient in question and peels off lower and upper bound.)
Note: If you save your results from question 2, you need not resample in this question.