This tutorial uses functions implemented in the developmental branch dev_PoMo_SNP
. To download and install RevBayes from source, please follow the instructions here.
The polymorphism-aware phylogenetic models (PoMos) are alternative approaches to species tree estimation (De Maio et al. 2013) that add a new layer of complexity to the standard substitution models by accounting for population-level forces to describe the process of sequence evolution (De Maio et al. 2015; Schrempf et al. 2016; Borges et al. 2019). PoMos model the evolution of a population of individuals in which changes in allele content (e.g., due to mutations) and frequency (e.g., due to genetic drift or selection) are both possible ().
PoMos stand out from the standard phylogenetic substitution models and other species tree methods because they:
Overall, PoMos constitute a full-likelihood yet computationally efficient approach to species tree inference. PoMos are designed to cope with recent radiations, including incomplete lineage sorting, and long divergence times.
PoMos model the evolution of a population of $N$ individuals and $K$ alleles in which changes in allele content and frequency occur. These are mediated by population forces such as mutation, genetic drift, and selection. The PoMo state-space includes fixed (or boundary) states \(\{Na_i\}\), in which all $N$ individuals have the same allele \(i \in \{0,1,...,K-1\}\), and polymorphic states \(\{na_i,(N-n)a_j\}\), in which two alleles $a_i$ and $a_j$ are present in the population with absolute frequencies $n$ and $N-n$.
Like the standard substitution models, PoMos are continuous-time Markov models and are fully characterized by their rate matrices. The rates in \ref{equation1} and \ref{equation2} define the PoMos rate matrices. RevBayes includes the fnPoMoKN
rate matrices that permit modeling population dynamics with any number of alleles, reversible mutations (i.e., $\mu_{a_ia_j}=\rho_{a_ia_j}\pi_{a_j}$) and selection. You can check the input parameters of this function by typing its name right after the question mark: ?fnPoMoKN
.
Arguments
K : Number of alleles
Type: Natural, <any>, value
V : Number of virtual individuals
Type: Natural, <any>, value
N : Number of effective individuals
Type: RealPos, <any>, const reference
Default: NULL
mu : Vector of mutation rates: mu=(mu_a0a1,mu_a1a0,mu_a0a2,mu_a2a0,...)
Type: RealPos[], <any>, const reference
phi : Vector of fitness coefficients: phi=(phi_0,phi_1,...,phi_ak)
Type: RealPos[], <any>, const reference
This tutorial demonstrates how to set up and perform analyses using polymorphism-aware phylogenetic models. You will perform phylogeny inference using the virtual PoMo Three. We will do this by setting the number of virtual (and effective) individuals to 3. These models allow for very efficient species tree inferences under selection because they operate on a small state space (missing reference). You will perform a Markov chain Monte Carlo (MCMC) analysis to estimate phylogeny and other model parameters. By the end of this tutorial, we leave as an exercise to run the neutral version (PoMoTwo) and compare the resulting trees. The graphical model representation under PoMoThree is depicted in figure .
PoMos perform inferences based on allele frequency data, which is stored in count files. These files contain two header lines. The first line indicates the number of taxa and the number of sites (or loci) in the sequence alignment. You might have noticed that NPOP stands for the number of populations, but this is not necessarily the case. PoMos can be used to infer the evolutionary history of different species or even other systematic units of interest, such as species, subspecies, communities, and so forth.
The second line specifies the genomic position of each locus (chromosome and location) and the taxon names. The first two columns are not used for inference, so if you’re working with taxa for which this information is unavailable, you can input these columns with dummy values (e.g., NA or ?). The remaining lines the other lines in the count file include allelic counts separated by commas. All elements in the count file are separated by white spaces. Here is an example of some lines from the great ape count file we will analyze in this tutorial:
COUNTSFILE NPOP 3 NSITES 5
CHROM POS Gorilla_beringei_graueri Gorilla_gorilla_dielhi Gorilla_gorilla_gorilla
chr1 41275799 6,0,0,0 2,0,0,0 54,0,0,0
chr2 120104878 6,0,0,0 2,0,0,0 54,0,0,0
chr11 61364549 0,6,0,0 0,2,0,0 0,54,0,0
chr17 44837427 6,0,0,0 2,0,0,0 54,0,0,0
chr19 7495905 4,0,2,0 2,0,0,0 10,0,44,0
The four allelic counts in this count file represent the allelic counts of the A, C, G, and T, respectively. Therefore, we know that the Gorilla_gorilla_gorilla
has an AG polymorphism at position 7 495 905 on chromosome 19. The order of alleles in the allelic counts can vary, but it is important to remember that the vectors of mutation rates, exchangeabilities, base frequencies, and fitness coefficients all follow the order of the allele counts in the count file:
The first step in this tutorial is to convert the allelic counts into PoMo states. Open the terminal and navigate to your working directory, which we will call PoMos (but you can choose any name you prefer). Inside PoMos, create the usual data and output folders. Before loading the data, run RevBayes by typing ./rb
(or ./rb-mpi
) in the console. Open the great_apes_pomothree.Rev
file using a suitable text editor so you can follow what each command is doing. Once you understand the .Rev
script in detail, you can run it automatically as follows:
./rb great_apes_pomothree.Rev
As mentioned earlier, the PoMo state space includes both fixed and polymorphic population states. However, allele counts are typically sampled from a small number of individuals. For example, sampled fixed sites may not actually be fixed in the original population. It is possible that we only sampled individuals with the same allele from polymorphic locus, leading to an inaccurate representation of the population’s true genetic diversity. The fewer individuals sampled, or the rarer the allele in the original population (e.g., singletons or doubletons), the more likely we are to observe false fixed sites in the sequence alignment.
There are methods that help us correct for this bias by attributing to each of the allelic counts an appropriate PoMo state. One such method is the weighted-method (Schrempf et al. 2016), which weights each PoMo state based on binomial sampling. In RevBayes, this is done automatically when we use the readPoMoCountFile
function and set the weighting to Binomial
. Alternatively, you can assign the PoMo state closest to the observed frequency. This method is called Fixed
. In this tutorial, we will use Fixed
. To use the readPoMoCountFile
function, define the location of the counts file, set the virtual population size (which we set to 3
, as we are using the virtual PoMo Three), specify the data type format PoMo
and apply the Fixed
correction as shown below:
N <- 3
data <- readPoMoCountFile(countFile="data/great_apes_1000.cf", virtualPopulationSize=N, format="PoMo", samplingCorrection="Fixed")
Information about the alignment can be obtained by typing data
.
>data
PoMo character matrix with 12 taxa and 1000 characters
======================================================
Origination:
Number of taxa: 12
Number of included taxa: 12
Number of characters: 1000
Number of included characters: 1000
Datatype: PoMo
If, instead of a count file, you have a list of sequences per individual (in either fasta or nexus format), RevBayes can still convert it to PoMo data format. To do this, you need to read the sequences, provide a file with the taxon names, and perform the conversion to PoMo state space using pomoStateConvert
. Please ensure that individual sequences belonging to the same taxon have the same name. Here are the commands you will need:
data_char = readDiscreteCharacterData("data/individual_sequences.nex")
taxa = readTaxonData("data/taxon_names.txt")
data = pomoStateConvert(aln=data_char, k=4, virtualNe=N, taxa)
Next, we define some useful variables. These include the number of taxa, taxa names, and the number of branches, which will be important for setting up our model in later steps.
n_taxa <- data.ntaxa()
n_branches <- 2*n_taxa-3
taxa <- data.taxa()
Additionally, we will set up a variable that holds all the moves and monitors for our analysis. Recall that moves are algorithms used to propose new parameter values during the MCMC simulation, while monitors print the values of model parameters to the screen and/or log files during the MCMC analysis.
moves = VectorMoves()
monitors = VectorMonitors()
Estimating an unrooted tree under the virtual PoMos requires specifying two main components:
A given PoMo model is defined by its corresponding instantaneous rate matrix, Q
which depends on the virtual population size N
, the mutation rates, assumed to be reversible and dependent on the allele frequencies pi
, and the exchangeabilities rho
. PoMoThree additionally includes allele fitnesses phi
, as it accounts for selection. We will set up the virtual PoMoThree using the function fnPoMoKN
. In particular, we set N
to 3. Note that N
is a fixed node, as we had previously defined.
Since pi
, rho
, and gamma
are stochastic variables, we must specify a move to propose updates to them. A good move for variables drawn from a Dirichlet distribution (i.e., pi
) is the mvBetaSimplex
move. This move randomly selects an element from the allele frequency vector pi
, proposes a new value drawn from a beta distribution, and then rescales all values to sum to 1. The weight
option inside the moves specifies how often the move will be applied, either on average per iteration or relative to all other moves.
# allele frequencies
pi_prior <- [1,1,1,1]
pi ~ dnDirichlet(pi_prior)
moves.append( mvBetaSimplex(pi, weight=2) )
The rho
and phi
parameters must be positive real numbers and a natural choice for their prior distributions is the exponential distribution. Again, we need to specify a move for these stochastic variables, and a simple scaling move, mvScale
, typically works. In this tutorial, we want our model to capture the effect of GC-bias gene conversion. For that, we define gamma
, the GC-bias rate. The allele fitnesses phi
for G and C will be represented by gamma
, while those for A and T by 1.0. Note that phi
is a deterministic node that depends on the GC-bias rate gamma
.
# exchangeabilities
for (i in 1:6){
rho[i] ~ dnExponential(10.0)
moves.append(mvScale( rho[i], weight=2 ))
}
# fitness coefficients
gamma ~ dnExponential(1.0)
moves.append(mvScale( gamma, weight=2 ))
phi := [1.0,1.0+gamma,1.0+gamma,1.0]
Because we want the mutations to be reversible, we build the mutation rate vector as a deterministic variable depending on pi
and rho
:
# mutation rates
K <- 4
mu := fnPoMoReversibleMutationRates(K,pi,rho)
Alternatively, if we wanted to define a nonreversible mutation rate vector, we could have set mu
directly, similar to how we set rho
.
The function fnPoMoKN
will create an instantaneous rate matrix. This function requires that an effective population size be input, but in most cases, you will not know it. Therefore, simply set it to the virtual population size.
# rate matrix
Q := fnPoMoKN(K,N,N,mu,phi)
The tree topology and branch lengths are stochastic nodes in our phylogenetic model. We will assume that all possible labeled, unrooted tree topologies have equal probability. For an unrooted tree topology, we use the nearest-neighbor interchange move mvNNI
(a subtree-prune and regrafting move mvSPR
could also be used).
# topology
topology ~ dnUniformTopology(taxa)
moves.append( mvNNI(topology, weight=2*n_taxa) )
Next, we create a stochastic node representing the length of each of the 2*n_taxa−3
branches in our tree. We can use a “for” loop to create a vector of branch lengths and assign a move to it.
# branch lengths
for (i in 1:n_branches) {
branch_lengths[i] ~ dnExponential(10.0)
moves.append( mvScale(branch_lengths[i]) )
}
Finally, we combine the tree topology and branch lengths using the treeAssembly
function, which applies the value of the ith member of the branch_lengths
vector to the branch leading to the ith node in the topology. Thus, the psi
variable is a deterministic node:
psi := treeAssembly(topology, branch_lengths)
We have now fully specified all of the parameters of our phylogenetic model:
psi
;Q
;PoMo
.Collectively, these parameters comprise a distribution called the phylogenetic continuous-time Markov chain, and we use the dnPhyloCTMC
function to create this node. This distribution requires several input arguments:
sequences ~ dnPhyloCTMC(psi,Q=Q,type="PoMo")
Once the PhyloCTMC
model is created, we can attach our sequence data to the tip nodes of the tree. Although we assume that our sequence data are random variables, they are realizations of our phylogenetic model. For inference, we assume that the sequence data are clamped to their observed values.
sequences.clamp(data)
When this function is called, RevBayes sets each of the stochastic nodes representing the tree’s tips to the corresponding nucleotide sequence in the alignment, indicating that those sequences have been observed.
Finally, we wrap the entire model in a single object. To do this, we simply pass model
function one of the nodes previously defined.
pomo_model = model(pi)
For our MCMC analysis, we need to set up a vector of monitors to record the states of our Markov chain. First, we will initialize the model monitor using the mnModel
function. This creates a monitor variable that will output the states for all model parameters when passed into an MCMC function. We will sample every 10th generation, and the resulting file will be found in the output folder.
monitors.append( mnModel(filename="output/great_apes_pomothree.log", printgen=10) )
The mnFile
monitor will record the states for only the parameters passed as arguments. We use this monitor to specify the output for our sampled trees and branch lengths. Again, we sample every 10th generation.
monitors.append( mnFile(filename="output/great_apes_pomothree.trees", printgen=10, psi) )
Next, we create a screen monitor that will report the states of specified variables to the screen using mnScreen
. This monitor helps us track the progress of the MCMC run.
monitors.append( mnScreen(printgen=10) )
With a fully specified model, a set of monitors, and a set of moves, we can now set up the MCMC algorithm that will sample parameter values in proportion to their posterior probability. The mcmc
function will create our MCMC object. Additionally, we will perform two independent MCMC runs to ensure proper convergence and mixing.
pomo_mcmc = mcmc(pomo_model, monitors, moves, nruns=2, combine="mixed")
Now, we can start the MCMC run.
pomo_mcmc.run( generations=100000 )
Once the analysis is complete, you will find the monitored files in your output directory. Software like Tracer allows you to evaluate convergence and mixing. Look at the file output/great_apes_pomothree.log
in Tracer. There, you will see the posterior distribution of the continuous parameters. Let us examine the posterior distribution of the GC-bias rate $\gamma$. Is there any evidence of GC-bias in these great ape sequences?
In addition to continuous parameters, we also need to summarize the trees sampled from the posterior distribution. RevBayes can summarize the sampled trees by reading in the tree trace file:
trace = readTreeTrace("output/great_apes_pomothree.trees", treetype="non-clock", burnin= 0.2)
The mapTree
function will summarize the tree samples and write the maximum a posteriori (MAP) tree to the specified file. The MAP tree can be found in the output folder.
mapTree(trace, file="output/great_apes_pomothree_MAP.tree" )
You can look at the file output/great_apes_pomothree_MAP.tree
and open it in FigTree. The maximum a posteriori estimate of the great ape phylogeny under the PoMoThree model should look like that of .
We note that while visual inspection might be a good exercise to evaluate the convergence and mixing of the MCMC samples, quantitative methods exist and are recommended. These are implemented in R and ready to use; check tutorial Convergence assessment.
What is the GC-bias rate (this is the selection coefficient) for the great ape populations? Rescale it to its real value by assuming the great apes have an effective population size of about 10 000 individuals. Use the relation $(1+\gamma’)^{N-1}=(1+\gamma)^{N_e-1}$ to rescale $\gamma$, where $N$ and $N_e$ represent the virtual and effective population sizes, and $\gamma’$ and $\gamma$ are the GC-bias rates for the virtual and effective populations.
Using as your guide, draw the probabilistic graphical model of the neutral PoMoTwo model.
What changes are necessary in the great_apes_pomothree.Rev
file to make inferences under the neutral PoMoTwo model?
Run an MCMC analysis to estimate the posterior distribution under the PoMoTwo model. Are the resulting estimates of mutation rates (base frequencies and exchangeabilities) equal? If not, how much do they differ?
Compare the MAP trees estimated under PoMoTwo and PoMoThree. Are they equal? If not, how much do they differ?