# Coalescent Analyses

### Estimating Demographic Histories with Coalescent Models

#### Ronja Billenstein and Sebastian Höhna

This tutorial and the included exercises are currently under construction. If you want to run the analyses, please compile RevBayes from the dev-coalescent branch as described here (for the development branch). For plotting with the R package RevGadgets, please install the package from the “dev_pop_size” branch using devtools: devtools::install_github("revbayes/RevGadgets", ref = "dev_pop_size").

## Overview

This tutorial describes how to run a demographic analysis using the coalescent process in RevBayes. Demographic inference is about estimating population dynamics and in this tutorial, we will specifically focus on population size estimation. The coalescent process provides a flexible way of estimating population size trajectories through time. We are primarily interested in asking: (1) what the population size for our given population was? and (2) how population sizes have changed over time, e.g., increased or decreased towards the present or experienced a bottleneck. The input data are usually sequence alignments or estimated trees. In case of sequence data, trees and population size parameters will be jointly estimated. Here, we consider different types of analysis, starting with a constant demographic history and doing more complex analyses in later exercises. The first analyses will consider data from isochronous samples, i.e., data from samples that have all been collected at the same time. In the second part of the exercises, you will be asked to also analyze data from heterochronous samples, i.e., samples that have been collected at different points in time. The data for these analyes are taken from Vershinina et al. (2021).

The following table provides an overview of different coalescent models. The tutorials column links directly to the respective tutorial with isochronous data. The tutorial on heterochronous data shows how to apply the models to data from different points in time.

Model Description Demographic function Prior on the population sizes Interval change point method Multiple loci analysis possible? Tutorial
Constant A simple coalescent model with one constant population size. constant individually chosen - Yes Constant
Linear A coalescent model with a linear population size trajectory. linear individually chosen - Yes
Exponential A coalescent model with an exponential population size trajectory. exponential individually chosen - Yes
Bayesian Skyline Plot without autocorrelation A coalescent model with a piecewise constant population size trajectory. The different pieces (intervals) are not correlated. The times of interval change are dependent on coalescent events (“event-based”). constant individually chosen, independent and identically distributed (iid) - Yes Skyline
Bayesian Skyline Plot with autocorrelation
(Drummond et al. 2005)
A coalescent model with a piecewise constant population size trajectory. The different pieces (intervals) are correlated, each population size is drawn from an exponential distribution with the mean being the previous population size. The times of interval change are dependent on coalescent events (“event-based”). constant $\theta_i | \theta_{i-1}$ (exponential distribution with mean $\theta_{i-1}$) event-based No, only single loci Little description included in skyline

Example script
Extended Bayesian Skyline Plot
(Heled and Drummond 2008)
A coalescent model with a piecewise constant or linear population size trajectory. The different pieces are not correlated. The times of interval change are independent from coalescent events, they are independent and identically distributed (iid). The number of changes and thus the number of intervals is determined by stochastic variable search. constant independent and identically distributed (iid) event-based - number and position of change points estimated Yes Little description included in skyline

Example script
Skyride
(Minin et al. 2008)
A coalescent model with a piecewise constant population size trajectory. The different pieces are correlated via a Gaussian Markov Random Field (GMRF) Prior, the degree of smoothing is determined by a precision parameter. The times of interval change are dependent on coalescent events (“event-based”). constant $\theta_i | \theta_{i-1}$ (GMRF Prior with precision parameter) event-based No, only single loci Little description included in skyline

Example script
Skygrid
(Gill et al. 2012)
A coalescent model with a piecewise constant population size trajectory. The different pieces are correlated via a Gaussian Markov Random Field (GMRF) Prior and are equally spaced in time, the degree of smoothing is determined by a precision parameter. The times of interval change are independent from coalescent events, they are independent and identically distributed (iid). Can be considered as extension of the skyride model. constant or linear $\theta_i | \theta_{i-1}$ (GMRF Prior with precision parameter) equally sized Yes
Gaussian Markov Random Field (GMRF) Prior with interval change points independent from coalescent events A coalescent model with a piecewise constant or linear population size trajectory. The different pieces are correlated via a Gaussian Markov Random Field (GMRF) Prior. The times of interval change are independent from coalescent events, they are independent and identically distributed (iid). constant or linear $\theta_i | \theta_{i-1}$ (GMRF Prior) specified - here equally sized Yes With sequences as input data: GMRF

With trees as input data: GMRF treebased
Horseshoe Markov Random Field (HSMRF) Prior with interval change points independent from coalescent events (Faulkner et al. 2020) A coalescent model with a piecewise constant or linear population size trajectory. The different pieces are correlated via a Horseshoe Markov Random Field (HSMRF) Prior. The times of interval change are independent from coalescent events, they are independent and identically distributed (iid). constant or linear $\theta_i | \theta_{i-1}$ (HSMRF Prior) specified - here equally sized Yes Little description included in GMRF

Script in HSMRF
Compound Poisson Process (CPP) Prior
(similar to Opgen-Rhein et al. (2005))
A coalescent model with a piecewise constant or linear population size trajectory. The number of pieces (intervals) is not fixed. A Poisson Prior is used on the number of interval change points, additional priors need to be defined for population sizes and change points. The number of change points is determined via reversible jump MCMC (rjMCMC). constant or linear independent and identically distributed (iid) specified - number and position of change points estimated Yes Little description included in GMRF

Script in CPP
Piecewise
(similar to Pybus and Rambaut (2002), but with flexible combinations)
A coalescent model with user-defined demographic functions for a number of intervals. Base demographic functions included in RevBayes are constant, linear and exponential population size trajectories. These can be combined in an arbitrary way, but the most ancient one should always be constant due to computational reasons. constant and linear and exponential individually chosen, independent and identically distributed (iid) specified - individually chosen No, only single loci Piecewise

## Preparation

In order to perform the different analyses in this tutorial, you will need to create a directory on your computer for this tutorial and download a few files.

### The Data

You should now have the following files in your data folder:

• horses_isochronous_sequences.fasta: an alignment in FASTA format, containing sequences of 36 horse taxa, all sampled at the same time.

• horses_heterochronous_sequences.fasta: an alignment in FASTA format, containing sequences of 173 horse taxa, sampled at various times.

• horses_heterochronous_ages.tsv: a tab seperated table listing the heterochronous horse samples and their ages. For extant taxa, the minimum age is 0.0 (i.e. the present).

### The Scripts

It is useful to create .Rev scripts for the different analyes. We will also provide a full script in every tutorial that you can easily run for the whole analysis.

Please create a “scripts” directory in the tutorial directory.

## Exercises

You can begin by clicking on the first exercise!

If you want to turn back to this page, it is listed on the left as prerequisite in all the exercises. The exercise for heterochronous data is based on the exercises for isochronous data. It aims at highlighting the changes you have to make when considering samples with different ages. It is therefore recommended to do the exercises for isochronous data first.

The first five exercises work with isochronous data, the last one with heterochronous data:

## Summary

After doing all the exercises, you can compare the resulting population sizes. Have a look at the summary.

1. Drummond A.J., Rambaut A., Shapiro B., Pybus O.G. 2005. Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences. Molecular Biology and Evolution. 22:1185–1192. 10.1093/molbev/msi103
2. Faulkner J.R., Magee A.F., Shapiro B., Minin V.N. 2020. Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories. Biometrics. 76:677–690. https://doi.org/10.1111/biom.13276
3. Gill M.S., Lemey P., Faria N.R., Rambaut A., Shapiro B., Suchard M.A. 2012. Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci. Molecular Biology and Evolution. 30:713–724. 10.1093/molbev/mss265
4. Griffiths R.C., Tavare S., Bodmer W.F., Donnelly P.J. 1994. Sampling theory for neutral alleles in a varying environment. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences. 344:403–410. 10.1098/rstb.1994.0079
5. Heled J., Drummond A.J. 2008. Bayesian inference of population size history from multiple loci. BMC Evolutionary Biology. 8:289. 10.1186/1471-2148-8-289
6. Kingman J.F.C. 1982. On the Genealogy of Large Populations. Journal of Applied Probability. 19:27–43. 10.2307/3213548
7. Minin V.N., Bloomquist E.W., Suchard M.A. 2008. Smooth Skyride through a Rough Skyline: Bayesian Coalescent-Based Inference of Population Dynamics. Molecular Biology and Evolution. 25:1459–1471. 10.1093/molbev/msn090
8. Opgen-Rhein R., Fahrmeir L., Strimmer K. 2005. Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo. BMC Evolutionary Biology. 5:6. 10.1186/1471-2148-5-6
9. Pybus O.G., Rambaut A. 2002. GENIE: estimating demographic history from molecular phylogenies. Bioinformatics. 18:1404–1405. 10.1093/bioinformatics/18.10.1404
10. Vershinina A.O., Heintzman P.D., Froese D.G., Zazula G., Cassatt-Johnstone M., Dalén L., Der Sarkissian C., Dunn S.G., Ermini L., Gamba C., Groves P., Kapp J.D., Mann D.H., Seguin-Orlando A., Southon J., Stiller M., Wooller M.J., Baryshnikov G., Gimranov D., Scott E., Hall E., Hewitson S., Kirillova I., Kosintsev P., Shidlovsky F., Tong H.-W., Tiunov M.P., Vartanyan S., Orlando L., Corbett-Detig R., MacPhee R.D., Shapiro B. 2021. Ancient horse genomes reveal the timing and extent of dispersals across the Bering Land Bridge. Molecular Ecology. 30:6144–6161. https://doi.org/10.1111/mec.15977