Model selection of common substitution models for one locus

Comparing relative model fit with Bayes factors

Sebastian Höhna, Michael J Landis, Tracy A Heath

Last modified on April 9, 2024

Overview


This tutorial provides the third protocol from our recent publication (Höhna et al. 2017). The first protocol is described in the Nucleotide substitution models and the second protocol is described in the Partitioned data analysis.

You should read first the General Introduction to Model selection tutorial, which explains the theory and standard algorithms for estimating marginal likelihoods and Bayes factors.

Substitution Models


The models we use here are equivalent to the models described in the previous exercise on substitution models (continuous time Markov models). To specify the model please consult the previous exercise. Specifically, you will need to specify the following substitution models:

Just to be safe, it is better to clear the workspace (if you did not just restart RevBayes):

clear()

Now set up the model as in the previous exercise. You should start with the simple Jukes-Cantor substitution model. Setting up the model requires:

  1. Loading the data and retrieving useful variables about it (e.g., number of sequences and taxon names).
  2. Specifying the instantaneous-rate matrix of the substitution model.
  3. Specifying the tree model including branch-length variables.
  4. Creating a random variable for the sequences that evolved under the PhyloCTMC.
  5. Clamping the data.
  6. Creating a model object.
  7. Specifying the moves for parameter updates.

The following procedure for estimating marginal likelihoods is valid for any model in RevBayes. You will need to repeat this later for other models. First, we create the variable containing the power-posterior analysis. This requires that we provide a model and vector of moves, as well as an output file name. The cats argument sets the number of stepping stones.

pow_p = powerPosterior(mymodel, moves, monitors, "output/model1.out", cats=50)

We can start the power-posterior analysis by first burning in the chain and and discarding the first 10000 states. This will help ensure that analysis starts from a region of high posterior probability, rather than from some random point.

pow_p.burnin(generations=10000,tuningInterval=1000)

Now execute the run with the .run() function:

pow_p.run(generations=1000)

Once the power posteriors have been saved to file, create a stepping stone sampler. This function can read any file of power posteriors and compute the marginal likelihood using stepping-stone sampling.

ss = steppingStoneSampler(file="output/model1.out", powerColumnName="power", likelihoodColumnName="likelihood")

These commands will execute a stepping-stone simulation with 50 stepping stones, sampling 1000 states from each step. Compute the marginal likelihood under stepping-stone sampling using the member function marginal() of the ss variable and record the value in Table [tab:ml_cytb].

ss.marginal()

Path sampling is an alternative to stepping-stone sampling and also takes the same power posteriors as input.

ps = pathSampler(file="output/model1.out", powerColumnName="power", likelihoodColumnName="likelihood")

Compute the marginal likelihood under stepping-stone sampling using the member function marginal() of the ps variable and record the value in Table [tab_ml_subst_models].

ps.marginal()

We have kept this description of how to use stepping-stone-sampling and path-sampling very generic and did not provide the information about the model here. Our main motivation is to show that the marginal likelihood estimation algorithms are independent of the model. Thus, you can apply these algorithms to any model, e.g., relaxed clock models and birth-death models, as well.

Exercises


Model Path-Sampling Stepping-Stone-Sampling
JC ($M_1$)    
HKY ($M_2$)    
GTR ($M_3$)    
GTR+$\Gamma$ ($M_4$)    
GTR+I ($M_5$)    
GTR+$\Gamma$+I ($M_6$)    
Marginal likelihoods for different substitution models.

Now you can continue to the next tutorial: Model selection of partition models or Model averaging of substitution models

  1. Hasegawa M., Kishino H., Yano T. 1985. Dating of the Human-Ape Splitting by a molecular Clock of Mitochondrial DNA. Journal of Molecular Evolution. 22:160–174. 10.1007/BF02101694
  2. Höhna S., Landis M.J., Heath T.A. 2017. Phylogenetic Inference using RevBayes. Current Protocols in Bioinformatics. 10.1002/cpbi.22
  3. Jukes T.H., Cantor C.R. 1969. Evolution of Protein Molecules. Mammalian Protein Metabolism. 3:21–132. 10.1016/B978-1-4832-3211-9.50009-7
  4. Tavaré S. 1986. Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Some Mathematical Questions in Biology: DNA Sequence Analysis. 17:57–86.
  5. Yang Z. 1994. Maximum Likelihood Phylogenetic Estimation from DNA Sequences with Variable Rates Over Sites: Approximate Methods. Journal of Molecular Evolution. 39:306–314. 10.1007/BF00160154