In this tutorial, you will develop an intuition for continuous-time Markov models of the sort used by RevBayes, as well as by other programs, such as PAUP, MrBayes, PhyloBayes, and others. The basic phylogenetic model, used in all of these programs has several components. You are probably familiar with two of them: a phylogenetic tree describing the relationships among the species with the branch lengths specified in terms of expected amount of change. The difficult part of the phylogenetic model is how character change is modeled along the branches of the tree. All of these programs assume that characters evolve along the branches of the tree according to a continuous-time Markov model.
If you had a highschool social life as awkward as this author’s, you already own a 10-sided die (d10, in the gaming lingo). If you actually had friends in high school, however, you may need to buy such a die. Go to a gaming store and tell them that you want to buy a “d10.” Alternatively, buy a 10-sided die from Amazon.
Examine your new die. Note that it has ten faces, with each face numbered from 0 to 9. You can generate a random number on the interval (0,1) by repeatedly rolling your die. You will assume the “0.” of the number and use the die to randomly generate the digit in the tenths place (0._), hundreths place (0._ _), thousandths place (0. _ _ _) etc. until you have the random number to the desired precision. Let’s try it! I rolled my die three times and saw, in order, the numbers 0, 4, and 7. My uniform(0,1) random number, then, is $u = 0.047$. Try it yourself. With three rolls of the die, you can generate a random number to a precision of three decimal places. Do you understand why the number you generate in this manner is uniformly-distributed on the interval (0,1)? For reference, the uniform(0,1) probability distribution looks like,
and is sometimes referred to as the rectangular distribution, for obvious reasons.
The uniform distribution is only one of many distributions. You have probably heard of at least some of the more common distributions (the normal, log-normal, binomial, gamma, Poisson, exponential, $\chi^2$, Student’s t) and perhaps you’ve heard of some of the more obscure distributions, too (Wishart, Normal inverse Wishart, and Weibull, among others). When simulating DNA sequence evolution on a tree, in addition to the uniform(0,1) random numbers, we will need to generate exponential random variables.
The exponential distribution is used to model waiting times. Imagine that something occurs at a constant rate, $\lambda$. The time until that something occurs is exponentially distributed with parameter $\lambda$. The exponential distribution looks like,
While we can use the die to generate a uniformly-distributed random number, we cannot directly generate an exponentially-distributed number. That said, we can generate an exponential random variable from our uniformly-distributed random number using some math. First, generate a uniform(0,1) random number using your die, called $u$. We can convert this uniform(0,1) random number to an exponential random number using the following equation: \(t = -{\log(u) \over \lambda}\) where $\lambda$ is the rate at which something occurs and $\log$ is the natural log function. (You can access the natural log function on your smart phone by going to the calculator app and turning the phone on its side, thereby revealing the full functionality of the calculator.) The variable $t$ is exponentially distributed.
Let’s try it. We will generate an exponential random number when the rate parameter is $\lambda = 10$. With my die, I generated a uniform(0,1) random number: $u = 0.948$. Using my calculator, I convert it to an exponential random number: \(t = -{\log(0.948) \over 10} = 0.00534\)
We now have the machinery needed to generate uniform and exponential random numbers. For the simulation of DNA sequences on a tree, however, we need to choose some simulation parameters. Specifically, we need the tree topology, branch lengths, and rate matrix of the continuous-time Markov model that describes how the DNA sequences change over time.
We will assume the following tree for the simulations:
We will simulate on this tree for no particular reason except that I like this tree. Note the branch lengths on the tree. The branch lengths are in terms of expected number of substitutions per site. Again, the branch lengths were an arbitrary choice that I made.
The last part of the model that must be specified is the rate matrix of the continuous-time Markov process that describes how the DNA sequences change on the tree. We will assume that sequences evolve according to the HKY85 model of DNA substitution, that has rate matrix:
\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} \cdot & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & \cdot \\ \end{array} \right) \mu\]We will make a few important points about the rate matrix. First, the rate matrix may have free parameters. For example, the HKY85 model has the parameters $\kappa$, $\pi_A$, $\pi_C$, $\pi_G$, and $\pi_T$. The parameter $\kappa$ is the transition/transversion rate bias; when $\kappa = 1$ transitions occur at the same rate as transversions. Typically, the transition/transversion rate ratio, estimated using maximum likelihood or Bayesian inference, is greater than one; transitions occur at a higher rate than transversions. The other parameters – $\pi_A$, $\pi_C$, $\pi_G$, and $\pi_T$ – are the base frequencies, and have a biological interpretation as the frequency of the different nucleotides and are also, incidentally, the stationary probabilities of the process. Second, the rate matrix, ${\mathbf Q}$, can be used to calculate the transition probabilities and the stationary distribution of the substitution process. The transition probabilities and stationary distribution play a key role in calculating the likelihood.
We will assume the following values for the HKY85 parameters: $\kappa = 5$, $\pi_A = 0.4$, $\pi_C = 0.3$, $\pi_G = 0.2$, and $\pi_T = 0.1$. These values result in the following scaled rate matrix:
\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{rrrr} -0.886 & 0.190 & 0.633 & 0.063 \\ 0.253 & -0.696 & 0.127 & 0.316 \\ 1.266 & 0.190 & -1.519 & 0.063 \\ 0.253 & 0.949 & 0.127 & -1.329 \\ \end{array} \right)\]The stationary probabilities for this rate matrix are $\pi_A = 0.4$, $\pi_C = 0.3$, $\pi_G = 0.2$, and $\pi_T = 0.1$.
The rate matrix specifies how changes occur on a phylogenetic tree. Consider the very simple case of a single branch on a phylogenetic tree. Let’s assume that the branch is $v=0.5$ in length. Our first task is to determine the nucleotide at the root of this tree. Although it is tempting to simply pick a nucleotide at the root of the tree with each nucleotide having a probability of $1/4$, doing so is not consistent with the process we are assuming, as described in the rate matrix, ${\mathbf Q}$. Rather, we should choose the state at the root of the tree from the stationary probabilities. I made four intervals, with the following probabilities:
\[0.0 - 0.4 \rightarrow A \\ 0.4 - 0.7 \rightarrow C \\ 0.7 - 0.9 \rightarrow G \\ 0.9 - 1.0 \rightarrow T\]I rolled the die to generate a uniiform(0,1) random number and obtained $u = 0.709$. The nucleotide at the root, then, is the nucleotide $G$. The situation we have is something like this,
in which we have a single branch of length $v = 0.5$ starting in the nucleotide $G$. How can we simulate the evolution of the site starting from the $G$ at the ancestor? The rate matrix tells us how to do this. First of all, because the current state of the process is $G$, the only relevant row of the rate matrix is the third one:
\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ 1.266 & 0.190 & -1.519 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right)\]The overall rate of change away from nucleotide $G$ is $q_{GA} + q_{GC} + q_{GT} = 1.266 + 0.190 + 0.063 = 1.519$. Equivalently, the rate of change away from nucleotide $G$ is simply $-q_{GG} = 1.519$. In a continuous-time Markov model, the waiting time between substitutions is exponentially distributed. The exact shape of the exponential distribution is determined by its rate, which is the same as the rate of the corresponding process in the ${\mathbf Q}$ matrix. For instance, if we are in state $G$, we wait an exponentially distributed amount of time with rate 1.519 until the next substitution occurs.
I generated an exponential(1.519) random variable by first generating a uniform(0,1) random number with my die. The first number it generated is $u = 0.794$. This means that the next time at which a substitution occurs is 0.152 up from the root of the tree [i.e., $t = -{1 \over 1.519} \log(0.794)$]. We can now color a portion of the branch because we know the process was in state $G$ from the root of the single-branch tree ($t=0.0$) to $t=0.152$:
The rate matrix also specifies the probabilities of a change from $G$ to the nucleotides $A$, $C$, and $T$. These probabilities are \(\begin{array}{ccc} G \rightarrow A: {1.266\over 1.519}=0.833, & G \rightarrow C: {0.190\over 1.519}=0.125, & G \rightarrow T: {0.063 \over 1.519}=0.042 \\ \end{array}\) To determine what nucleotide the process changes to we would generate another uniform(0,1) random number (again called $u$). If $u$ is between 0 and 0.833, we will say that we had a change from $G$ to $A$. If the random number is between 0.833 and 0.958 we will say that we had a change from $G$ to $C$. Finally, if the random number $u$ is between 0.958 and 1.000, we will say we had a change from $G$ to $T$. The next number generated using the die was $u = 0.102$, which means the change was from $G$ to $A$. The process is now in a different state (the nucleotide $A$) and the relevant row of the rate matrix is
\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} -0.886 & 0.190 & 0.633 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right)\]We wait an exponentially distributed amount of time with parameter $\lambda = 0.886$ until the next substitution occurs. When the substitution occurs, it is to a $C$, $G$, or $T$ with probabilities ${0.190 \over 0.886} = 0.214$, ${0.633 \over 0.886} = 0.714$, and ${0.063 \over 0.886} = 0.072$, respectively.
This process of generating random and exponentially-distributed times until the next substitution occurs and then determining (randomly) what nucleotide the change is to is repeated until the process exceeds the length of the branch. The state the process is in when it passes the end of the branch is recorded. To complete the simulation on the branch, I generated another uniform random variable using the die. The number was $u = 0.371$, which means that the next substitution would occur 1.119 units above the substitution from $G \rightarrow A$. The process is in the state $A$ when it passed the end of the branch:
The only non-random part of the entire procedure was the initial choice of the parameters. All other aspects of the simulation used a uniform random number generator and our knowledge of the rate matrix to simulate a single realization of the HKY85 process of DNA substitution.
Simulating on the tree
is only slightly more complicated than simulating data on the single-branch tree. The steps are as follows:
The tree we simulate DNA sequence evolution on has only four tips. This means that there are a total of $4^4 = 256$ possible patterns of nucleotides we could have observed at the tips of the tree. For example, one of the possible patterns is GTTC: Species I has the nucletide G, Species II and Species III are assigned the nucleotide T, and Species IV is assigned C.
The probability of simulating any of the 256 patterns is given in the following table:
Pattern | Prob. | Pattern | Prob. | Pattern | Prob. | Pattern | Prob. |
---|---|---|---|---|---|---|---|
AAAA | 0.199465 | AGAA | 0.014711 | CAAA | 0.018317 | CGAA | 0.001490 |
AAAC | 0.004185 | AGAC | 0.000725 | CAAC | 0.000628 | CGAC | 0.000210 |
AAAG | 0.014711 | AGAG | 0.019868 | CAAG | 0.001490 | CGAG | 0.002878 |
AAAT | 0.001395 | AGAT | 0.000242 | CAAT | 0.000166 | CGAT | 0.000048 |
AACA | 0.009075 | AGCA | 0.000843 | CACA | 0.005277 | CGCA | 0.000669 |
AACC | 0.000703 | AGCC | 0.000315 | CACC | 0.004524 | CGCC | 0.002262 |
AACG | 0.000843 | AGCG | 0.002202 | CACG | 0.000669 | CGCG | 0.002304 |
AACT | 0.000121 | AGCT | 0.000048 | CACT | 0.000375 | CGCT | 0.000188 |
AAGA | 0.028625 | AGGA | 0.005985 | CAGA | 0.003304 | CGGA | 0.001065 |
AAGC | 0.000702 | AGGC | 0.000755 | CAGC | 0.000210 | CGGC | 0.000209 |
AAGG | 0.005985 | AGGG | 0.032738 | CAGG | 0.001065 | CGGG | 0.006655 |
AAGT | 0.000234 | AGGT | 0.000252 | CAGT | 0.000048 | CGGT | 0.000059 |
AATA | 0.003025 | AGTA | 0.000281 | CATA | 0.000959 | CGTA | 0.000120 |
AATC | 0.000121 | AGTC | 0.000048 | CATC | 0.000360 | CGTC | 0.000180 |
AATG | 0.000281 | AGTG | 0.000734 | CATG | 0.000120 | CGTG | 0.000420 |
AATT | 0.000154 | AGTT | 0.000073 | CATT | 0.000404 | CGTT | 0.000202 |
ACAA | 0.004185 | ATAA | 0.001395 | CCAA | 0.000628 | CTAA | 0.000166 |
ACAC | 0.005482 | ATAC | 0.000350 | CCAC | 0.009592 | CTAC | 0.000415 |
ACAG | 0.000725 | ATAG | 0.000242 | CCAG | 0.000210 | CTAG | 0.000048 |
ACAT | 0.000350 | ATAT | 0.001594 | CCAT | 0.000415 | CTAT | 0.001214 |
ACCA | 0.000703 | ATCA | 0.000121 | CCCA | 0.004524 | CTCA | 0.000375 |
ACCC | 0.019527 | ATCC | 0.000752 | CCCC | 0.167489 | CTCC | 0.005866 |
ACCG | 0.000315 | ATCG | 0.000048 | CCCG | 0.002262 | CTCG | 0.000188 |
ACCT | 0.000752 | ATCT | 0.001546 | CCCT | 0.005866 | CTCT | 0.007452 |
ACGA | 0.000702 | ATGA | 0.000234 | CCGA | 0.000210 | CTGA | 0.000048 |
ACGC | 0.001837 | ATGC | 0.000116 | CCGC | 0.004796 | CTGC | 0.000208 |
ACGG | 0.000755 | ATGG | 0.000252 | CCGG | 0.000209 | CTGG | 0.000059 |
ACGT | 0.000116 | ATGT | 0.000535 | CCGT | 0.000208 | CTGT | 0.000607 |
ACTA | 0.000121 | ATTA | 0.000154 | CCTA | 0.000360 | CTTA | 0.000404 |
ACTC | 0.001781 | ATTC | 0.000517 | CCTC | 0.011625 | CTTC | 0.001716 |
ACTG | 0.000048 | ATTG | 0.000073 | CCTG | 0.000180 | CTTG | 0.000202 |
ACTT | 0.000517 | ATTT | 0.004711 | CCTT | 0.001716 | CTTT | 0.013873 |
GAAA | 0.045565 | GGAA | 0.005060 | TAAA | 0.006106 | TGAA | 0.000497 |
GAAC | 0.001004 | GGAC | 0.000453 | TAAC | 0.000166 | TGAC | 0.000048 |
GAAG | 0.005060 | GGAG | 0.017648 | TAAG | 0.000497 | TGAG | 0.000959 |
GAAT | 0.000335 | GGAT | 0.000151 | TAAT | 0.000099 | TGAT | 0.000038 |
GACA | 0.002514 | GGCA | 0.000532 | TACA | 0.000959 | TGCA | 0.000120 |
GACC | 0.000315 | GGCC | 0.000194 | TACC | 0.000548 | TGCC | 0.000274 |
GACG | 0.000532 | GGCG | 0.002904 | TACG | 0.000120 | TGCG | 0.000420 |
GACT | 0.000048 | GGCT | 0.000036 | TACT | 0.000215 | TGCT | 0.000108 |
GAGA | 0.014437 | GGGA | 0.008240 | TAGA | 0.001101 | TGGA | 0.000355 |
GAGC | 0.000476 | GGGC | 0.001251 | TAGC | 0.000048 | TGGC | 0.000059 |
GAGG | 0.008240 | GGGG | 0.056794 | TAGG | 0.000355 | TGGG | 0.002218 |
GAGT | 0.000159 | GGGT | 0.000417 | TAGT | 0.000038 | TGGT | 0.000030 |
GATA | 0.000838 | GGTA | 0.000177 | TATA | 0.001119 | TGTA | 0.000143 |
GATC | 0.000048 | GGTC | 0.000036 | TATC | 0.000231 | TGTC | 0.000116 |
GATG | 0.000177 | GGTG | 0.000968 | TATG | 0.000143 | TGTG | 0.000488 |
GATT | 0.000073 | GGTT | 0.000040 | TATT | 0.000893 | TGTT | 0.000447 |
GCAA | 0.001004 | GTAA | 0.000335 | TCAA | 0.000166 | TTAA | 0.000099 |
GCAC | 0.001837 | GTAC | 0.000116 | TCAC | 0.001389 | TTAC | 0.000240 |
GCAG | 0.000453 | GTAG | 0.000151 | TCAG | 0.000048 | TTAG | 0.000038 |
GCAT | 0.000116 | GTAT | 0.000535 | TCAT | 0.000240 | TTAT | 0.002009 |
GCCA | 0.000315 | GTCA | 0.000048 | TCCA | 0.000548 | TTCA | 0.000215 |
GCCC | 0.009764 | GTCC | 0.000376 | TCCC | 0.019456 | TTCC | 0.001275 |
GCCG | 0.000194 | GTCG | 0.000036 | TCCG | 0.000274 | TTCG | 0.000108 |
GCCT | 0.000376 | GTCT | 0.000773 | TCCT | 0.001275 | TTCT | 0.006924 |
GCGA | 0.000476 | GTGA | 0.000159 | TCGA | 0.000048 | TTGA | 0.000038 |
GCGC | 0.001823 | GTGC | 0.000117 | TCGC | 0.000694 | TTGC | 0.000120 |
GCGG | 0.001251 | GTGG | 0.000417 | TCGG | 0.000059 | TTGG | 0.000030 |
GCGT | 0.000117 | GTGT | 0.000530 | TCGT | 0.000120 | TTGT | 0.001005 |
GCTA | 0.000048 | GTTA | 0.000073 | TCTA | 0.000231 | TTTA | 0.000893 |
GCTC | 0.000891 | GTTC | 0.000258 | TCTC | 0.004935 | TTTC | 0.003240 |
GCTG | 0.000036 | GTTG | 0.000040 | TCTG | 0.000116 | TTTG | 0.000447 |
GCTT | 0.000258 | GTTT | 0.002355 | TCTT | 0.003240 | TTTT | 0.031522 |
Simulate a site on the four-species tree described in this lab using the rate matrix.
\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{rrrr} -0.886 & 0.190 & 0.633 & 0.063 \\ 0.253 & -0.696 & 0.127 & 0.316 \\ 1.266 & 0.190 & -1.519 & 0.063 \\ 0.253 & 0.949 & 0.127 & -1.329 \\ \end{array} \right)\]Why do we start the simulation by drawing from the stationary distribution?