Understanding Continuous-Time Markov Models

Simulating DNA sequence evolution with a die

John Huelsenbeck

Last modified on September 12, 2019

Simulating DNA Sequences

In this tutorial, you will develop an intuition for continuous-time Markov models of the sort used by RevBayes, as well as by other programs, such as PAUP, MrBayes, PhyloBayes, and others. The basic phylogenetic model, used in all of these programs has several components. You are probably familiar with two of them: a phylogenetic tree describing the relationships among the species with the branch lengths specified in terms of expected amount of change. The difficult part of the phylogenetic model is how character change is modeled along the branches of the tree. All of these programs assume that characters evolve along the branches of the tree according to a continuous-time Markov model.

Generating uniform and exponential random variables

If you had a highschool social life as awkward as this author’s, you already own a 10-sided die (d10, in the gaming lingo). If you actually had friends in high school, however, you may need to buy such a die. Go to a gaming store and tell them that you want to buy a “d10.” Alternatively, buy a 10-sided die from Amazon.

Examine your new die. Note that it has ten faces, with each face numbered from 0 to 9. You can generate a random number on the interval (0,1) by repeatedly rolling your die. You will assume the “0.” of the number and use the die to randomly generate the digit in the tenths place (0._), hundreths place (0._ _), thousandths place (0. _ _ _) etc. until you have the random number to the desired precision. Let’s try it! I rolled my die three times and saw, in order, the numbers 0, 4, and 7. My uniform(0,1) random number, then, is $u = 0.047$. Try it yourself. With three rolls of the die, you can generate a random number to a precision of three decimal places. Do you understand why the number you generate in this manner is uniformly-distributed on the interval (0,1)? For reference, the uniform(0,1) probability distribution looks like,

and is sometimes referred to as the rectangular distribution, for obvious reasons.

The uniform distribution is only one of many distributions. You have probably heard of at least some of the more common distributions (the normal, log-normal, binomial, gamma, Poisson, exponential, $\chi^2$, Student’s t) and perhaps you’ve heard of some of the more obscure distributions, too (Wishart, Normal inverse Wishart, and Weibull, among others). When simulating DNA sequence evolution on a tree, in addition to the uniform(0,1) random numbers, we will need to generate exponential random variables.

The exponential distribution is used to model waiting times. Imagine that something occurs at a constant rate, $\lambda$. The time until that something occurs is exponentially distributed with parameter $\lambda$. The exponential distribution looks like,

While we can use the die to generate a uniformly-distributed random number, we cannot directly generate an exponentially-distributed number. That said, we can generate an exponential random variable from our uniformly-distributed random number using some math. First, generate a uniform(0,1) random number using your die, called $u$. We can convert this uniform(0,1) random number to an exponential random number using the following equation: $t = -{\log(u) \over \lambda}$ where $\lambda$ is the rate at which something occurs and $\log$ is the natural log function. (You can access the natural log function on your smart phone by going to the calculator app and turning the phone on its side, thereby revealing the full functionality of the calculator.) The variable $t$ is exponentially distributed.

Let’s try it. We will generate an exponential random number when the rate parameter is $\lambda = 10$. With my die, I generated a uniform(0,1) random number: $u = 0.948$. Using my calculator, I convert it to an exponential random number: $t = -{\log(0.948) \over 10} = 0.00534$

The parameters for the simulation

We now have the machinery needed to generate uniform and exponential random numbers. For the simulation of DNA sequences on a tree, however, we need to choose some simulation parameters. Specifically, we need the tree topology, branch lengths, and rate matrix of the continuous-time Markov model that describes how the DNA sequences change over time.

We will assume the following tree for the simulations:

We will simulate on this tree for no particular reason except that I like this tree. Note the branch lengths on the tree. The branch lengths are in terms of expected number of substitutions per site. Again, the branch lengths were an arbitrary choice that I made.

The last part of the model that must be specified is the rate matrix of the continuous-time Markov process that describes how the DNA sequences change on the tree. We will assume that sequences evolve according to the HKY85 model of DNA substitution, that has rate matrix:

\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} \cdot & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & \cdot \\ \end{array} \right) \mu\]

We will make a few important points about the rate matrix. First, the rate matrix may have free parameters. For example, the HKY85 model has the parameters $\kappa$, $\pi_A$, $\pi_C$, $\pi_G$, and $\pi_T$. The parameter $\kappa$ is the transition/transversion rate bias; when $\kappa = 1$ transitions occur at the same rate as transversions. Typically, the transition/transversion rate ratio, estimated using maximum likelihood or Bayesian inference, is greater than one; transitions occur at a higher rate than transversions. The other parameters – $\pi_A$, $\pi_C$, $\pi_G$, and $\pi_T$ – are the base frequencies, and have a biological interpretation as the frequency of the different nucleotides and are also, incidentally, the stationary probabilities of the process. Second, the rate matrix, ${\mathbf Q}$, can be used to calculate the transition probabilities and the stationary distribution of the substitution process. The transition probabilities and stationary distribution play a key role in calculating the likelihood.

We will assume the following values for the HKY85 parameters: $\kappa = 5$, $\pi_A = 0.4$, $\pi_C = 0.3$, $\pi_G = 0.2$, and $\pi_T = 0.1$. These values result in the following scaled rate matrix:

\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{rrrr} -0.886 & 0.190 & 0.633 & 0.063 \\ 0.253 & -0.696 & 0.127 & 0.316 \\ 1.266 & 0.190 & -1.519 & 0.063 \\ 0.253 & 0.949 & 0.127 & -1.329 \\ \end{array} \right)\]

The stationary probabilities for this rate matrix are $\pi_A = 0.4$, $\pi_C = 0.3$, $\pi_G = 0.2$, and $\pi_T = 0.1$.

Interpreting the rate matrix

The rate matrix specifies how changes occur on a phylogenetic tree. Consider the very simple case of a single branch on a phylogenetic tree. Let’s assume that the branch is $v=0.5$ in length. Our first task is to determine the nucleotide at the root of this tree. Although it is tempting to simply pick a nucleotide at the root of the tree with each nucleotide having a probability of $1/4$, doing so is not consistent with the process we are assuming, as described in the rate matrix, ${\mathbf Q}$. Rather, we should choose the state at the root of the tree from the stationary probabilities. I made four intervals, with the following probabilities:

\[0.0 - 0.4 \rightarrow A \\ 0.4 - 0.7 \rightarrow C \\ 0.7 - 0.9 \rightarrow G \\ 0.9 - 1.0 \rightarrow T\]

I rolled the die to generate a uniiform(0,1) random number and obtained $u = 0.709$. The nucleotide at the root, then, is the nucleotide $G$. The situation we have is something like this,

in which we have a single branch of length $v = 0.5$ starting in the nucleotide $G$. How can we simulate the evolution of the site starting from the $G$ at the ancestor? The rate matrix tells us how to do this. First of all, because the current state of the process is $G$, the only relevant row of the rate matrix is the third one:

\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ 1.266 & 0.190 & -1.519 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right)\]

The overall rate of change away from nucleotide $G$ is $q_{GA} + q_{GC} + q_{GT} = 1.266 + 0.190 + 0.063 = 1.519$. Equivalently, the rate of change away from nucleotide $G$ is simply $-q_{GG} = 1.519$. In a continuous-time Markov model, the waiting time between substitutions is exponentially distributed. The exact shape of the exponential distribution is determined by its rate, which is the same as the rate of the corresponding process in the ${\mathbf Q}$ matrix. For instance, if we are in state $G$, we wait an exponentially distributed amount of time with rate 1.519 until the next substitution occurs.

I generated an exponential(1.519) random variable by first generating a uniform(0,1) random number with my die. The first number it generated is $u = 0.794$. This means that the next time at which a substitution occurs is 0.152 up from the root of the tree [i.e., $t = -{1 \over 1.519} \log(0.794)$]. We can now color a portion of the branch because we know the process was in state $G$ from the root of the single-branch tree ($t=0.0$) to $t=0.152$:

The rate matrix also specifies the probabilities of a change from $G$ to the nucleotides $A$, $C$, and $T$. These probabilities are $\begin{array}{ccc} G \rightarrow A: {1.266\over 1.519}=0.833, & G \rightarrow C: {0.190\over 1.519}=0.125, & G \rightarrow T: {0.063 \over 1.519}=0.042 \\ \end{array}$ To determine what nucleotide the process changes to we would generate another uniform(0,1) random number (again called $u$). If $u$ is between 0 and 0.833, we will say that we had a change from $G$ to $A$. If the random number is between 0.833 and 0.958 we will say that we had a change from $G$ to $C$. Finally, if the random number $u$ is between 0.958 and 1.000, we will say we had a change from $G$ to $T$. The next number generated using the die was $u = 0.102$, which means the change was from $G$ to $A$. The process is now in a different state (the nucleotide $A$) and the relevant row of the rate matrix is

\[{\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} -0.886 & 0.190 & 0.633 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right)\]

We wait an exponentially distributed amount of time with parameter $\lambda = 0.886$ until the next substitution occurs. When the substitution occurs, it is to a $C$, $G$, or $T$ with probabilities ${0.190 \over 0.886} = 0.214$, ${0.633 \over 0.886} = 0.714$, and ${0.063 \over 0.886} = 0.072$, respectively.

This process of generating random and exponentially-distributed times until the next substitution occurs and then determining (randomly) what nucleotide the change is to is repeated until the process exceeds the length of the branch. The state the process is in when it passes the end of the branch is recorded. To complete the simulation on the branch, I generated another uniform random variable using the die. The number was $u = 0.371$, which means that the next substitution would occur 1.119 units above the substitution from $G \rightarrow A$. The process is in the state $A$ when it passed the end of the branch:

The only non-random part of the entire procedure was the initial choice of the parameters. All other aspects of the simulation used a uniform random number generator and our knowledge of the rate matrix to simulate a single realization of the HKY85 process of DNA substitution.

Simulating on a more complicated tree

Simulating on the tree

is only slightly more complicated than simulating data on the single-branch tree. The steps are as follows:

First, generate the state at the root of the tree. This step requires knowledge of the stationary probabilities for the Markov process specified by the rate matrix, ${\mathbf Q}$. The stationary distribution is the probability of capturing the process in a particular state when it has been run for a very long (technically, infinitely long) time. The stationary probabilities for the rate matrix we chose are $\pi_A = 0.4$, $\pi_C = 0.3$, $\pi_G = 0.2$, and $\pi_T = 0.1$.
Visit each branch in turn in preorder sequence (that is, from the root to the tips of the tree). If you visit the branches in preorder sequence, you will know the state at the root of the branch.

Pattern probabilities

The tree we simulate DNA sequence evolution on has only four tips. This means that there are a total of $4^4 = 256$ possible patterns of nucleotides we could have observed at the tips of the tree. For example, one of the possible patterns is GTTC: Species I has the nucletide G, Species II and Species III are assigned the nucleotide T, and Species IV is assigned C.

The probability of simulating any of the 256 patterns is given in the following table:

Pattern	Prob.	Pattern	Prob.	Pattern	Prob.	Pattern	Prob.
AAAA	0.199465	AGAA	0.014711	CAAA	0.018317	CGAA	0.001490
AAAC	0.004185	AGAC	0.000725	CAAC	0.000628	CGAC	0.000210
AAAG	0.014711	AGAG	0.019868	CAAG	0.001490	CGAG	0.002878
AAAT	0.001395	AGAT	0.000242	CAAT	0.000166	CGAT	0.000048
AACA	0.009075	AGCA	0.000843	CACA	0.005277	CGCA	0.000669
AACC	0.000703	AGCC	0.000315	CACC	0.004524	CGCC	0.002262
AACG	0.000843	AGCG	0.002202	CACG	0.000669	CGCG	0.002304
AACT	0.000121	AGCT	0.000048	CACT	0.000375	CGCT	0.000188
AAGA	0.028625	AGGA	0.005985	CAGA	0.003304	CGGA	0.001065
AAGC	0.000702	AGGC	0.000755	CAGC	0.000210	CGGC	0.000209
AAGG	0.005985	AGGG	0.032738	CAGG	0.001065	CGGG	0.006655
AAGT	0.000234	AGGT	0.000252	CAGT	0.000048	CGGT	0.000059
AATA	0.003025	AGTA	0.000281	CATA	0.000959	CGTA	0.000120
AATC	0.000121	AGTC	0.000048	CATC	0.000360	CGTC	0.000180
AATG	0.000281	AGTG	0.000734	CATG	0.000120	CGTG	0.000420
AATT	0.000154	AGTT	0.000073	CATT	0.000404	CGTT	0.000202
ACAA	0.004185	ATAA	0.001395	CCAA	0.000628	CTAA	0.000166
ACAC	0.005482	ATAC	0.000350	CCAC	0.009592	CTAC	0.000415
ACAG	0.000725	ATAG	0.000242	CCAG	0.000210	CTAG	0.000048
ACAT	0.000350	ATAT	0.001594	CCAT	0.000415	CTAT	0.001214
ACCA	0.000703	ATCA	0.000121	CCCA	0.004524	CTCA	0.000375
ACCC	0.019527	ATCC	0.000752	CCCC	0.167489	CTCC	0.005866
ACCG	0.000315	ATCG	0.000048	CCCG	0.002262	CTCG	0.000188
ACCT	0.000752	ATCT	0.001546	CCCT	0.005866	CTCT	0.007452
ACGA	0.000702	ATGA	0.000234	CCGA	0.000210	CTGA	0.000048
ACGC	0.001837	ATGC	0.000116	CCGC	0.004796	CTGC	0.000208
ACGG	0.000755	ATGG	0.000252	CCGG	0.000209	CTGG	0.000059
ACGT	0.000116	ATGT	0.000535	CCGT	0.000208	CTGT	0.000607
ACTA	0.000121	ATTA	0.000154	CCTA	0.000360	CTTA	0.000404
ACTC	0.001781	ATTC	0.000517	CCTC	0.011625	CTTC	0.001716
ACTG	0.000048	ATTG	0.000073	CCTG	0.000180	CTTG	0.000202
ACTT	0.000517	ATTT	0.004711	CCTT	0.001716	CTTT	0.013873
GAAA	0.045565	GGAA	0.005060	TAAA	0.006106	TGAA	0.000497
GAAC	0.001004	GGAC	0.000453	TAAC	0.000166	TGAC	0.000048
GAAG	0.005060	GGAG	0.017648	TAAG	0.000497	TGAG	0.000959
GAAT	0.000335	GGAT	0.000151	TAAT	0.000099	TGAT	0.000038
GACA	0.002514	GGCA	0.000532	TACA	0.000959	TGCA	0.000120
GACC	0.000315	GGCC	0.000194	TACC	0.000548	TGCC	0.000274
GACG	0.000532	GGCG	0.002904	TACG	0.000120	TGCG	0.000420
GACT	0.000048	GGCT	0.000036	TACT	0.000215	TGCT	0.000108
GAGA	0.014437	GGGA	0.008240	TAGA	0.001101	TGGA	0.000355
GAGC	0.000476	GGGC	0.001251	TAGC	0.000048	TGGC	0.000059
GAGG	0.008240	GGGG	0.056794	TAGG	0.000355	TGGG	0.002218
GAGT	0.000159	GGGT	0.000417	TAGT	0.000038	TGGT	0.000030
GATA	0.000838	GGTA	0.000177	TATA	0.001119	TGTA	0.000143
GATC	0.000048	GGTC	0.000036	TATC	0.000231	TGTC	0.000116
GATG	0.000177	GGTG	0.000968	TATG	0.000143	TGTG	0.000488
GATT	0.000073	GGTT	0.000040	TATT	0.000893	TGTT	0.000447
GCAA	0.001004	GTAA	0.000335	TCAA	0.000166	TTAA	0.000099
GCAC	0.001837	GTAC	0.000116	TCAC	0.001389	TTAC	0.000240
GCAG	0.000453	GTAG	0.000151	TCAG	0.000048	TTAG	0.000038
GCAT	0.000116	GTAT	0.000535	TCAT	0.000240	TTAT	0.002009
GCCA	0.000315	GTCA	0.000048	TCCA	0.000548	TTCA	0.000215
GCCC	0.009764	GTCC	0.000376	TCCC	0.019456	TTCC	0.001275
GCCG	0.000194	GTCG	0.000036	TCCG	0.000274	TTCG	0.000108
GCCT	0.000376	GTCT	0.000773	TCCT	0.001275	TTCT	0.006924
GCGA	0.000476	GTGA	0.000159	TCGA	0.000048	TTGA	0.000038
GCGC	0.001823	GTGC	0.000117	TCGC	0.000694	TTGC	0.000120
GCGG	0.001251	GTGG	0.000417	TCGG	0.000059	TTGG	0.000030
GCGT	0.000117	GTGT	0.000530	TCGT	0.000120	TTGT	0.001005
GCTA	0.000048	GTTA	0.000073	TCTA	0.000231	TTTA	0.000893
GCTC	0.000891	GTTC	0.000258	TCTC	0.004935	TTTC	0.003240
GCTG	0.000036	GTTG	0.000040	TCTG	0.000116	TTTG	0.000447
GCTT	0.000258	GTTT	0.002355	TCTT	0.003240	TTTT	0.031522

Exercises

Simulate a site on the four-species tree described in this lab using the rate matrix.

Why do we start the simulation by drawing from the stationary distribution?