## STAT536 : HW7Questions

Referers: 2008OldNews :: Fall2008 :: (Remote :: Orphans :: Tree )
Dorman Wiki
Dorman Lab Wiki

Hw7? question(1), is L(n|F.IS,p,c) likelihood a multinomial distribution (n.i1,n.i2,n.i3|P11,P12,P13), where P11/P12/P22 can be estimated from F.IS,q and c?  I am not sure I understand the question right.

The likelihood, which I'll write in Bayesian-style notation as $P ( n | F I S , p , c )$, can be written, using the law of total probability by conditioning and integrating over the unobserved $q$, as

$P ( n | F I S , p , c ) = ∫ q P ( n | F I S , q ) P ( q | p , c ) d q$

Here, I have dropped from the dependence quantities that are independent of the random variables, i.e. $p$ is dropped from the distribution of $n$ because once $q$ is known, $p$ is irrelevant and $F I S$ has been dropped from the distribution of $q$ because $F I S$ is a within-population property and $q$ are allele frequencies across populations (also see notes).  $P ( q | p , c )$ was given explicitly in class.  I'm just asking you to repeat it.  And the other conditional probability is multinomial as you suggest.

Hw7? question (2), do I need the equation (2) to estimate F.IS.hat and c.hat? I mean will the data, if any, from lnL(n|F.IS,c) be used for estimation or use some other formula for estimation,like var(q)=cp(1-p).

Yes. Even if you use $V a r ( q )$ (to get $F S T$, for example), you are going to need something, namely $c ^$, to plug in for $c$.  You get $c ^$ by maximizing eq. (2).  [Note $V a r ( q ) ≠ c p ( 1 - p )$ because the distribution assumed for $q$ is truncated at 0 and 1.]

Let me just try to explain what is going on.  See equation (1).  It is the likelihood that you technically need to maximize over $F I S$, $p$, and $c$ in order to generate estimates of $F I S$, $F S T$, and $F I T$ (the latter two functions of the first three MLEs).

Unfortunately (1) is difficult to maximize, so eq. (2) makes some assumptions.  It first assumes that the integral in (1) accumulates most of its volume at and near $q = q ^$.  Therefore, rather than computing the integral, it just substitutes in $q ^$ to the integrand and dispenses with the integral.  Second, equation (2) also assumes that the likelihood in (1) which is a function sitting in 4-dimensional space above the $( F I S , p , c )$ space has a sharply peaked ridge above the plane defined by $p = p ^$.  In other words, for all possible values $F I S$ and $c$, including at the special values $F ^ I S$ and $c ^$, equation (2) assumes that $p = p ^$ maximizes the likelihood.

Do I estimate 12 F.IS, one for each subpopulation?

This is a good question, because I failed to state an assumption:

Assume $F i s$ is constant across subpopulations.  Please note, for a subpopulation with allele frequency $q$

$F I S = E ( A 1 A 1 | q ) - q 2 q ( 1 - q )$

(from the notes, p. 3 of lec13).  Rearrange:

$E ( A 1 A 1 | q ) = q 2 + F I S q ( 1 - q )$

$E ( A 1 A 1 | q )$ is the expected proportion of A1A1 genotype in the subpopulation.  In other words, $F I S$ is just the inbreeding $f$ we used before for single populations.  Here, we are assuming the inbreeding within each subpopulation is the same.

> 2. similarly, would I have 12 F.is and c mle for each subpopln? (what's the R code for n! btw?)

$c$ is a parameter in the distribution of $q$.  There is one $q$ for each subpopulation, let's call them $q 1 , q 2 , ... , q 12$.  In particular, $c$ relates to the variance in $q 1 , q 2 , ... , q 12$.  It does not make sense to have a $c$ for each population, so there is only one $c$.

(n! = factorial(n))

2. When use the optim to find F.is and c estimate, what's the supposed function?

You are trying to maximize the log likelihood of the data over the unknown parameters $F i s$ and $c$.  What is the likelihood?  Parts (a) and (b) are all about obtaining a simplified version of the log likelihood, resulting finally in eq. (2).

You ask a numerical approach to get var(q), and I assume bootstrap is one strategy to get it. Again same question, since q = (q1,q2,...,…q12), how should I do the simulation then?

Bootstrap is for getting the variance on a statistic estimated from data.  Here you are asked ONLY to get the MLEs.  However, $F ^ S T$ is a function of $F ^ I S$ and $c ^$.  Specifically, this function involves the variance of a truncated normal distribution.  If it wasn't truncated, the variance is just $c p ( 1 - p )$, and  you would plug in $c ^$ (and $p ^$) for $c$ (and $p$).  However, now you are asked to get the variance of this truncated normal distribution.

Now, suppose you didn't know the variance of some arbitrary distribution $g$, but you could simulate from it.  How would you estimatethe variance?  Yes, a concept RELATED, but not the same thing as, thebootstrap enters here. ...

You simulate much data $X$ from g and then compute var$( X )$. So, now the key question:  Can you simulate data from the truncated normal?