## DormanResearch : CBrotherTutorial

Referers: cBrother :: CBrotherData :: CBrotherInstallation :: CBrotherManual :: CBrotherObtain :: CBrotherProcessingOutput :: CBrotherRunning :: SarahSpencer :: (Remote :: Orphans :: Tree )
Dorman Wiki
Dorman Lab Wiki

#### 0cBrother Tutorial

You are going to analyze two HBV recombinant sequences in a series of steps using the MCMC sampler cBrother and perl processing script procpost.  If you have not already downloaded and installed cBrother, do so now using the link above.  Installation of procpost is included as part of this tutorial.  This tutorial assumes that you have installed cBrother in a central location and can access it simply by typing the command cbrother.

Extra information and details are included in brown .  Specific unix commands are written in green .  Output is in red.  Estimated time to complete each step is boxed.  The tutorial in total will take 30-60 minutes, though you can speed it up considerably by taking indicated shortcuts.

1. 15 sec. Create a working directory where you will place all files and output (mkdir cbro).  Change into that directory now (cd cbro).

2. 30 sec. Download the data alignment cd.phy (right-click "Save Link As..." or similar).  It is in phylip format.

File "cd.phy" contains one putative HBV recombinant AF461043 and six representative sequences, one each from HBV genotypes A, B, C, D, F, and H.  The representative sequence used is the consensus generated from several known nonrecombinant sequences for each genotype.  We suspect AF461043 is a C/D recombinant, and we will use cBrother to verify this suspicion.

You may find the perl script ptop that comes with procpost to be useful in manipulating Fasta, Phylip, and ClustalW? alignments.

3. 30 sec.. Download the command file cd.cmdfile that will be used to analyze the alignment.

Only two parameters are altered from the command file defaults.  The length of the MCMC run is set using option length:, because we don't want to spend all day on this tutorial.  A parental tree is specified using option parent_tree: that fixes the relationship among all six representative sequences.  We assume there is no uncertainty in this relationship, something that is worth testing, but not addressed here.

4. 4 min. Run cBrother twice.  There will be a lot of output to the screen and it will take a few minutes.  Be patient!

cbrother $RANDOM cd.cmdfile cd.phy cda.post cbrother$RANDOM cd.cmdfile cd.phy cdb.post

producing sample output

Seed: 28011
DCPSamplerSetup (  0):  41.96  0.79  5 -18639.78 M5U  :|: -18639.78
DCPRun (  99): 2.81  0.03  2 -10101.04 M2U  :|: -10101.04
DCPRun (198): 2.77  0.03  2 -10100.69 M2U  :|: -10100.69
DCPRun (297): 3.08  0.04  2 -4692.58 M2U [1383 P]  2.35  0.03  2 -5390.67 M2U  :|: -10083.25
DCPRun (396): 2.80  0.04  2 -4676.30 M2U [1376 P]  2.35  0.03  2 -5407.55 M2U  :|: -10083.85
DCPRun (495): 2.95  0.04  2 -4617.45 M2U [1362 P]  2.35  0.03  2 -5467.10 M2U  :|: -10084.55
...

We want two posterior files so we can compare them using procpost to assess convergence.  The above commands assume linux with the Bash shell  that conveniently provides an environmental variable $RANDOM that generates a random integer on every invocation. You can replace$RANDOM in these commands with any random positive integer if you are not using Bash.

5. 2 min. Install procpost, available for download here.  Be sure to save this file into your working directory and extract the files.

tar xzvf procpostVERSION.tar.gz

Here, VERSION below should be replaced with the version you downloaded, e.g. 1.0.  You'll also want to obtain and install R  so procpost can help with convergence diagnostics.

The procpost script depends on several perl modules that are located in the directory lib after extraction.  The script procpost looks for perl modules in any central perl library location and in the directory ./lib in the current directory.  You will have to edit procpost and/or the location of the perl modules in order to use procpost from another directory or to install procpost centrally.

6. 1 min. Run procpost to obtain results and convergence statistics.

./procpost --post=cda.post:cdb.post --phy=cd.phy --genotypes=D:A:C:B:F:H --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=cd.out

We are interested in the inference of inter-genotype recombination of the query sequence AF461043 so we'll extract statistics about inter-genotype (aka ancestry) topology change points.  We'll also examine the inference of parameter change points as a second check on convergence.  Type ./procpost --help for more information about procpost options.

7. variable Examine critical lines of the procpost output, including the most likely recombinant structure (grep human_modal_ie_cop_structure: cd.out)

human_modal_ie_cop_structure: C-D-C

demonstrating that the recombinant has two crossover points.  Their posterior median locations are (grep median: cd.out)

ie_cop0_median: 1399
ie_cop1_median: 2177

showing the first topology change point at alignment position 1399, and the second at alignment position 2177 (your results may differ).

The procpost output is a collection of key/value pairs separated by a colon, much like the cmdfile format.  You can read the file in any text editor, or from the command line, the unix grep is very helpful for extracting relevant lines quickly. 95% Bayesian credible intervals are also computed (search for lbci and ubci).  For reasons beyond the scope of this tutorial, you should use the posterior median or mode (procpost argument --mode) as the point estimate for change point locations, rather than mean.

8. variable Check convergence.  If you have R installed, you can also use procpost to assess convergence.  Even if you don't have R, you must somehow assess convergence before finalizing your analysis.  Among the output values are potential scale reduction factors (PSRF; Gelman, 1992) and pvalues.  PSRFs should be close to 1, and everyone knows p-values should not be small.  For our runs, we find (grep gelman_rubin: cd.out && grep pvalue cd.out)

ie_cop0_gelman_rubin: 2.44315266173678
ie_cop1_gelman_rubin: 1.07957861403807
number_ie_cop_gelman_rubin: 1.15052539980724
number_pcp_gelman_rubin: 1.07083965214301
ie_cop0_bigger_smaller_than_median_proportions_test_pvalue01: 7.918443e-47
ie_cop0_gelman_rubin_pvalue: 0.6420397
ie_cop1_bigger_smaller_than_median_proportions_test_pvalue01: 0.2655302
ie_cop1_gelman_rubin_pvalue: 0.6859094
modal_ie_cop_structure_proportions_test_pvalue01: 6.417112e-20
number_ie_cop_gelman_rubin_pvalue: 0.6992192
number_pcp_gelman_rubin_pvalue: 0.6946738

indicating several reasons to worry about convergence for this example.

9. [Optional] 15 min. Not optional in reality, it is important to carry out longer MCMC runs.  Continue the current runs for another 400000 MCMC samples:

cbrother --length=510000 --continue=cda.post $RANDOM cd.cmdfile cd.phy cda2.post cbrother --length=510000 --continue=cdb.post$RANDOM cd.cmdfile cd.phy cdb2.post

and reanalyze for convergence

./procpost --post=cda2.post:cdb2.post --phy=cd.phy --genotypes=D:A:C:B:F:H --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=cd2.out

You may need to continue even further for good convergence.

10. 30 sec. Suppose you analyze another HBV sequence, accession AY817511, and find it is also a C-D-C recombinant.  Download the second alignment 2cd.phy, including both recombinants.  We will use cBrother to determine if the recombinants are related.

As opposed to the original analysis, this alignment includes three representative sequences each from genotypes D and C, the known parentals.  It includes one representative each from genotypes A and B that will serve as outgroups.  We will use cBrother to determine whether the two recombinants are always closest neighbors in the phylogeny at every point of the alignment.  If not, we must conclude that recombinants are not related, or at least one has gone through subsequent recombination.

It differs from the previous file in a few ways.  To speed up this tutorial, we will only infer the number and location of topology change points and not parameter change points.  We specify a genotype tree, allowing uncertainty in the branching order of representative sequences within a genotype.  We indicate which representative sequences belong to which genotypes using clade: options.  We also tell cBrother to allow split topologies where the genotype assigned to each query differs using the allow_split_ancestry:? option.  If the topology change points do not exactly match up for these two recombinants, split ancestry is expected.  Finally, we specify the monophyly prior on topology using the tau_prior: command file option.  This prior puts 50% prior probability on the event that both query sequences are monophyletic (nearest neighbors) throughout the alignment.  Posterior support for polyphyly above 95% would indicate strong evidence against the monophyly hypothesis.

12. 6 min. Run cBrother twice.  Be patient, or if you want to speed up this tutorial and ignore convergence testing, run cBrother only once.

cbrother $RANDOM 2cd.cmdfile 2cd.phy 2cda.post cbrother$RANDOM 2cd.cmdfile 2cd.phy 2cdb.post

13. variable Extract results and check convergence

./procpost --post=2cda.post:2cdb.post --phy=2cd.phy --genotypes=D:D:D:C:C:C:A:B --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=2cd.out --monophyly --cp_profile=ie --cp_profile=split --profile=ie --quiet > 2cd.profile

producing convergence statistics

ie_cop0_gelman_rubin: 1.00480744745207
ie_cop1_gelman_rubin: 0.999848465536918
ie_cop2_gelman_rubin: 2.54284524946627
number_ie_cop_gelman_rubin: 1.01390823377106
number_pcp_gelman_rubin: 1
ie_cop0_bigger_smaller_than_median_proportions_test_pvalue01: 0.04395422
ie_cop0_gelman_rubin_pvalue: 0.6836504
ie_cop1_bigger_smaller_than_median_proportions_test_pvalue01: 0.8007343
ie_cop1_gelman_rubin_pvalue: 0.6826527
ie_cop2_bigger_smaller_than_median_proportions_test_pvalue01: 1.699371e-78
ie_cop2_gelman_rubin_pvalue: 0.6511856
modal_ie_cop_structure_proportions_test_pvalue01: 3.093e-09
monophyly_proportions_test_pvalue01: 1
number_ie_cop_gelman_rubin_pvalue: 0.6825686
number_pcp_gelman_rubin_pvalue: 0.5

There is again an indication of failure to converge, and a proper analysis would include a longer run.  Also, we have used procpost to extract additional information.  --cp_profile=ie extracts all locations of inter-genotype topology changes point from all MCMC samples.  --profile=ie outputs the marginal distribution of parentage at each site along the genome.  Option --monophyly requests a test of the hypothesis of monophyly, whose results are reported as the Bayes factor in support of monophly (grep log10_bf_monophyly: 2cd.out)

log10_bf_monophyly: -inf

indicating no support at all for the monophyly hypothesis.

If you only ran cBrother once to produce output file 2cda.post, then replace the procpost command as follows:

./procpost --post=2cda.post --phy=2cd.phy --genotypes=D:D:D:C:C:C:A:B --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=2cd.out --monophyly --cp_profile=ie --cp_profile=split --profile=ie --quiet > 2cd.profile

14. 15 sec. Make a pretty plot

echo "source('plot.recomb.R'); plot.recomb(output.postscript=T)" | R --slave

and view the resulting postscript file, called 2cd.ps by default.  A sample plot is shown below.

The top plot in the resulting postscript file is a diagram of the most likely recombinant structure.  Topology change point estimates are indicated along with 95% Bayesian credible intervals in gray and genotype parentage in black text. The posterior support for each topology change point is indicated above the corresponding change point.  The second plot shows the distribution of topology change point locations, with blue indicating genotype change points and yellow highlighting the proportion that are split topology change points.  The bottom plot shows the posterior support for each genotype at each position along the alignment.  All plots show similar information in slightly different ways.  In this case, the overwhelming support for two split topology change points with a region of split topology between them indicates these two recombinants are not both descendents of the same recombination event.  The posterior probability that these recombinants are globally monophyletic is zero.