You are going to analyze two HBV recombinant sequences in a series of steps using the MCMC sampler cBrother
and perl processing script procpost
. If you have not already downloaded and installed cBrother, do so now using the link above. Installation of procpost
is included as part of this tutorial. This tutorial assumes that you have installed cBrother in a central location and can access it simply by typing the command cbrother
Extra information and details are included in brown
. Specific unix commands are written in green
. Output is in red
. Estimated time to complete each step is boxed
. The tutorial in total will take 30-60 minutes, though you can speed it up considerably by taking indicated shortcuts.
1. 15 sec. Create a working directory where you will place all files and output (mkdir cbro). Change into that directory now (cd cbro).
2. 30 sec.
Download the data alignment cd.phy
(right-click "Save Link As..." or similar). It is in phylip format.
File "cd.phy" contains one putative HBV recombinant AF461043 and six representative sequences, one each from HBV genotypes A, B, C, D, F, and H. The representative sequence used is the consensus generated from several known nonrecombinant sequences for each genotype. We suspect AF461043 is a C/D recombinant, and we will use cBrother to verify this suspicion.
You may find the perl script ptop
that comes with procpost to be useful in manipulating Fasta, Phylip, and ClustalW?
3. 30 sec..
Download the command file cd.cmdfile
that will be used to analyze the alignment.
Only two parameters are altered from the command file
defaults. The length of the MCMC run is set using option length:
, because we don't want to spend all day on this tutorial. A parental tree is specified using option parent_tree:
that fixes the relationship among all six representative sequences. We assume there is no uncertainty in this relationship, something that is worth testing, but not addressed here.
4. 4 min. Run cBrother twice. There will be a lot of output to the screen and it will take a few minutes. Be patient!
cbrother $RANDOM cd.cmdfile cd.phy cda.post
cbrother $RANDOM cd.cmdfile cd.phy cdb.post
producing sample output
DCPSamplerSetup ( 0): 41.96 0.79 5 -18639.78 M5U :|: -18639.78
DCPRun ( 99): 2.81 0.03 2 -10101.04 M2U :|: -10101.04
DCPRun (198): 2.77 0.03 2 -10100.69 M2U :|: -10100.69
DCPRun (297): 3.08 0.04 2 -4692.58 M2U [1383 P] 2.35 0.03 2 -5390.67 M2U :|: -10083.25
DCPRun (396): 2.80 0.04 2 -4676.30 M2U [1376 P] 2.35 0.03 2 -5407.55 M2U :|: -10083.85
DCPRun (495): 2.95 0.04 2 -4617.45 M2U [1362 P] 2.35 0.03 2 -5467.10 M2U :|: -10084.55
We want two posterior files so we can compare them using procpost
to assess convergence. The above commands assume linux with the Bash shell
that conveniently provides an environmental variable $RANDOM that generates a random integer on every invocation. You can replace $RANDOM in these commands with any random positive integer if you are not using Bash.
5. 2 min.
Install procpost, available for download here
. Be sure to save this file into your working directory and extract the files.
tar xzvf procpostVERSION.tar.gz
Here, VERSION below should be replaced with the version you downloaded, e.g. 1.0. You'll also want to obtain and install R
can help with convergence diagnostics.
The procpost script depends on several perl modules that are located in the directory lib after extraction. The script procpost looks for perl modules in any central perl library location and in the directory ./lib in the current directory. You will have to edit procpost and/or the location of the perl modules in order to use procpost from another directory or to install procpost centrally.
6. 1 min. Run procpost to obtain results and convergence statistics.
./procpost --post=cda.post:cdb.post --phy=cd.phy --genotypes=D:A:C:B:F:H --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=cd.out
We are interested in the inference of inter-genotype recombination of the query sequence AF461043 so we'll extract statistics about inter-genotype (aka ancestry) topology change points
. We'll also examine the inference of parameter change points as a second check on convergence. Type ./procpost --help
for more information about procpost
7. variable Examine critical lines of the procpost output, including the most likely recombinant structure (grep human_modal_ie_cop_structure: cd.out)
demonstrating that the recombinant has two crossover points. Their posterior median locations are (grep median: cd.out)
showing the first topology change point at alignment position 1399, and the second at alignment position 2177 (your results may differ).
output is a collection of key/value pairs separated by a colon, much like the cmdfile format
. You can read the file in any text editor, or from the command line, the unix grep
is very helpful for extracting relevant lines quickly. 95% Bayesian credible intervals are also computed (search for lbci and ubci). For reasons beyond the scope of this tutorial, you should use the posterior median or mode (procpost
argument --mode) as the point estimate for change point locations, rather than mean.
8. variable Check convergence. If you have R installed, you can also use procpost to assess convergence. Even if you don't have R, you must somehow assess convergence before finalizing your analysis. Among the output values are potential scale reduction factors (PSRF; Gelman, 1992) and pvalues. PSRFs should be close to 1, and everyone knows p-values should not be small. For our runs, we find (grep gelman_rubin: cd.out && grep pvalue cd.out)
indicating several reasons to worry about convergence for this example.
9. [Optional] 15 min. Not optional in reality, it is important to carry out longer MCMC runs. Continue the current runs for another 400000 MCMC samples:
cbrother --length=510000 --continue=cda.post $RANDOM cd.cmdfile cd.phy cda2.post
cbrother --length=510000 --continue=cdb.post $RANDOM cd.cmdfile cd.phy cdb2.post
and reanalyze for convergence
./procpost --post=cda2.post:cdb2.post --phy=cd.phy --genotypes=D:A:C:B:F:H --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=cd2.out
You may need to continue even further for good convergence.
10. 30 sec.
Suppose you analyze another HBV sequence, accession AY817511, and find it is also a C-D-C recombinant. Download the second alignment 2cd.phy
, including both recombinants. We will use cBrother to determine if the recombinants are related.
As opposed to the original analysis, this alignment includes three representative sequences each from genotypes D and C, the known parentals. It includes one representative each from genotypes A and B that will serve as outgroups. We will use cBrother to determine whether the two recombinants are always closest neighbors in the phylogeny at every point of the alignment. If not, we must conclude that recombinants are not related, or at least one has gone through subsequent recombination.
11. 30 sec.
Download the command file 2cd.cmdfile
for this analysis.
It differs from the previous file in a few ways. To speed up this tutorial, we will only infer the number and location of topology change points
and not parameter change points. We specify a genotype tree, allowing uncertainty in the branching order of representative sequences within a genotype. We indicate which representative sequences belong to which genotypes using clade:
options. We also tell cBrother to allow split topologies
where the genotype assigned to each query differs using the allow_split_ancestry:?
option. If the topology change points do not exactly match up for these two recombinants, split ancestry is expected. Finally, we specify the monophyly
prior on topology using the tau_prior:
command file option. This prior puts 50% prior probability on the event that both query sequences are monophyletic (nearest neighbors) throughout the alignment. Posterior support for polyphyly above 95% would indicate strong evidence against the monophyly hypothesis.
12. 6 min. Run cBrother twice. Be patient, or if you want to speed up this tutorial and ignore convergence testing, run cBrother only once.
cbrother $RANDOM 2cd.cmdfile 2cd.phy 2cda.post
cbrother $RANDOM 2cd.cmdfile 2cd.phy 2cdb.post
13. variable Extract results and check convergence
./procpost --post=2cda.post:2cdb.post --phy=2cd.phy --genotypes=D:D:D:C:C:C:A:B --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=2cd.out --monophyly --cp_profile=ie --cp_profile=split --profile=ie --quiet > 2cd.profile
producing convergence statistics
There is again an indication of failure to converge, and a proper analysis would include a longer run. Also, we have used procpost to extract additional information. --cp_profile=ie extracts all locations of inter-genotype topology changes point from all MCMC samples. --profile=ie outputs the marginal distribution of parentage at each site along the genome. Option --monophyly requests a test of the hypothesis of monophyly, whose results are reported as the Bayes factor in support of monophly (grep log10_bf_monophyly: 2cd.out)
indicating no support at all for the monophyly hypothesis.
If you only ran cBrother once to produce output file 2cda.post, then replace the procpost command as follows:
./procpost --post=2cda.post --phy=2cd.phy --genotypes=D:D:D:C:C:C:A:B --ie_cops --ie_cop_statistics --ie_cop_structure --pcp_cops --outfile=2cd.out --monophyly --cp_profile=ie --cp_profile=split --profile=ie --quiet > 2cd.profile
14. 15 sec. Make a pretty plot
echo "source('plot.recomb.R'); plot.recomb(output.postscript=T)" | R --slave
and view the resulting postscript file, called 2cd.ps by default. A sample plot is shown below.
The top plot in the resulting postscript file is a diagram of the most likely recombinant structure. Topology change point estimates are indicated along with 95% Bayesian credible intervals in gray and genotype parentage in black text. The posterior support for each topology change point is indicated above the corresponding change point. The second plot shows the distribution of topology change point locations, with blue indicating genotype change points and yellow highlighting the proportion that are split topology change points
. The bottom plot shows the posterior support for each genotype at each position along the alignment. All plots show similar information in slightly different ways. In this case, the overwhelming support for two split topology change points with a region of split topology between them indicates these two recombinants are not both descendents of the same recombination event. The posterior probability that these recombinants are globally monophyletic is zero.