Convergence assessment

Phylogenetic convergence assessment using the R package Convenience

Luiza Fabreti and Sebastian Höhna

Last modified on October 28, 2020

Overview


This tutorial covers convergence assessment of a Bayesian phylogenetic analysis using the R package Convenience.

Convergence of an MCMC analysis is crucial to assure that the chain has sampled from the stationary distribution and that we have sufficiently many samples to approximate the posterior distribution. That is, the MCMC has explored the parameter space long enough to reach the true posterior distribution of the parameters and the values we are sampling belong to that distribution. Theory says that a chain that runs through an infinite time, will reach convergence. For our practical problem, we need to make a decision of when we have sampled enough to take a good estimate of the desired parameters. Here we will show how to test if we have enough samples or not.

An ideal solution would be to analytically calculate the number of steps needed to reach convergence. This, however, has turned out to be unfeasible because the number of samples depends on the specific model, the specific moves applied within the MCMC simulation, and the given dataset.

Since we lack a theoretical convergence assessment, what is broadly done in the MCMC field is analyze the output from the MCMC for lack of convergence. To do so, we have to keep in mind two aspects of an analysis that has reached convergence: precision and reproducibility. Precision means that if we run the chain longer, we do not change the estimates (e.g., the posterior mean estimate). While reproducibility means that if we run another independent chain, we get the same estimates. Precision can be evaluated by checking that we have sufficiently many samples because more samples lead to less variance in the estimates (e.g., the standard error of the mean). Reproducibility, on the other hand, can be evaluated by comparing independent chains run under the same model. Therefore, it’s recommended to run at least two or better four replicates when performing MCMC analyses.

Another best practice is to remove the initial samples from the chain. Those initial iterations are called burn-in. By that we try to get rid of the samples that are not taken from the stationary distribution.

One last concept we need to keep in mind is the Effective Sample Size (ESS), i.e., the number of independent samples generated by our MCMC sampler. The ESS takes into account the correlation between samples within a chain. Low ESS values represent high autocorrelation in the chain. If the autocorrelation is higher, then the uncertainty in our estimates is also higher.

Criteria for Convergence using in Convenience


Now that we learned about convergence, let’s take a look into the criteria in the Convenience package:

The output of a phylogenetic analysis most commonly consists of two types of parameters:

To assess convergence for these parameters, the Convenience package evaluates:

The comparison between windows of the same run works by dividing the full length of the run into 5 windows (subsets) and comparing the third and fifth window. This comparison is used to determine the size of the burn-in. A sufficient burn-in will lead to windows that sampled values from the same distribution. Finding the appropriate burn-in size is done automatically in the Convenience package. The package tests burn-in of 0, 10%, 20%, 30%, 40% and 50%. If the required burn-in is higher than 50% of the length of the MCMC, we recommend re-running the MCMC.
In Figure we can see a trace plot for the tree length from the example provided in this tutorial. The trace plot shows the sampled values over the iterations of the MCMC. The highlighted areas of the figure show the third and fifth window of the run.

Trace plot of the tree length for our example analysis. The shaded areas show the third and fifth windows of the run.

How do we compare windows and runs?


For the continuous parameters, the comparison is made with the two-sample Kolmogorov-Smirnov (KS) test, a non-parametric statistical test for equality of probability distributions. Two samples will be equal when the KS value is below a given threshold. The KS value (D) is calculated:

\[{D}_{m,n} = \max_{x} |{F_{1,m}(x) - G_{2,n}(x)}|\]

F(x) and G(x) are the empirical distribution functions for the samples with size m and n, respectively. The two samples will be drawn from different distributions, at level $\alpha$, when:

\[{D}_{m,n} > c(\alpha) \sqrt{\frac{m + n}{m\times n}}\]

with

\[c(\alpha) = \sqrt{-\ln({\frac{\alpha}{2})\times \frac{1}{2}}}\]

The phylogenetic tree is evaluated regarding the bipartitions or splits. Therefore, the comparisons are made using the frequency of a given split between intervals of the same run or between different runs.

Two example trees with tips A, B, C, D and the splits seen at each tree.

Thresholds


The current state of convergence assessment in Bayesian phylogenetics relies mainly on visual tools (e.g., Tracer) and ESS thresholds that have no clear theory to support them (Rambaut et al. 2018). The motivation for the Convenience package is to provide an easy-to-use framework with clear thresholds for each convergence criterion.

We derive a minimum value for the ESS based on a normal distribution and the standard error of the mean (SEM). How much error in our estimate of the posterior mean should we find acceptable? Clearly, the mean estimate for a distribution with a large variance does not need to be as precise as the mean estimate for a distribution with a small variance. However, relative to the variance/spread of the distribution, what percentage is acceptable? We suggest to use a SEM smaller of 1% of the 95% probability interval of the distribution, which is equivalent to say that the allowed error of the mean is four times the standard deviation of the distribution. (Note that you can derive a different ESS value for any other threshold that you like.) From this SEM, we can derive the ESS with:

\[SEM = \frac{\sigma}{\sqrt{ESS}}\] \[\frac{\sigma}{\sqrt{ESS}} < 1\% \times 4 \times \sigma\] \[ESS > \frac{1}{0.04^2}\] \[ESS > 625\]

An ESS of 625 is therefore the default value for the convenience package.

For the KS test, the threshold is the critical value for $\alpha$ = 0.01. We are currently still evaluating the choice of $\alpha$ on the power and false-discovery rate to detect failure of convergence.

Split Frequencies


To date, the most often test to assess convergence of split frequencies is the average standard deviation of split frequencies (ASDSF) (Nylander et al. 2008). The frequency of each split is computed for two separate MCMC runs and the difference between the two split frequency estimates is used. The ASDSF is problematic for two reasons: (1) for large trees with many splits that have posterior probabilities close to 0.0 or 1.0 will overwhelm the ASDSF and hence even a single split that is present in all samples in run 1 (thus a posterior probability of 1.0) and is never present in any sample in run 2 (thus a posterior probability of 0.0) might not be detected, and (2) the expected difference in split frequency depends on the true split frequency (see Figure ).

The expected difference in split frequencies for ESS of 100, 200 and 625. The x-axis is the true value of the split frequency. The y-axis is the expected difference in split frequencies. The effect of increasing the ESS is the decrease of differences in frequency of sampled splits.

Instead of the ASDSF we use the ESS of each split. We transform each split into a chain of absence and presence values; if the split was present in the i-th tree then we score the i-th value of the chain as a 1 and 0 otherwise. This sequence of absence and presence observations (0s and 1s) allows us to apply standard methods to compute ESS values and thus we can use the same ESS threshold of 625 as for our continuous parameters.

With the ESS threshold for the splits, we can estimate the expected difference in the splits and use this value as a threshold for the split differencies. The expected difference ($ {E}[\Delta^{sf}_{p}] $) between two samples is calculated as the ‘mean absolute difference’, with N as the ESS:

\[{E}[\Delta^{sf}_{p}] = \sum\limits_{i=0}^N \sum\limits_{j=0}^N \left(|\frac{i}{N} - \frac{j}{N}| \times P_{binom}(i|N,p) \times P_{binom}(j|N,p) \right)\]

Summary


provides an overview of the convergence assessment described before and implemented in the package Convenience.

Overview of the workflow in the convergence assessment.

Convenience


Install


To install Convenience, we need first to install the package devtools. In R, type the commands:

install.packages("devtools")
library(devtools)
install_github("lfabreti/convenience")
library(convenience)

Functions


Here is a list of the functions the package uses to assess convergence:

Example


First, download the files listed as example output files on the top left of this page. Save them in a folder called output. These files are the output from a phylogenetic analysis performed with a dataset from bears. The nucleotide substitution model was GTR+$\Gamma$+I and the MCMC was set to run 2 independent runs. The package also works if your analysis has only one run of the MCMC. But the part to compare runs will not be evaluated. Therefore, it is not possible to say that your MCMC result is reproducible. We strongly advise on running more than 1 run.

Let’s run the checkConvergence function with our example output in a directory (this step may take a few minutes):

check_bears <- checkConvergence("output/")

We can also list the names of the files:

check_bears <- checkConvergence( list_files = c("bears_cytb_GTR_run_1.log", "bears_cytb_GTR_run_1.trees", "bears_cytb_GTR_run_2.log", "bears_cytb_GTR_run_2.trees") )

To better understand what checkConvergence is doing, take a look at . The first step consists of the user setting up the precision in the standard error of the mean for our estimates. The thresholds for ESS, KS test and difference in split frequencies will be calculated based on the precision. Then the function reads in the MCMC output, from the files provided by the user, and computes the quantities used in the criteria for convergence. Afterwards, the function checks if the computed quantities are within the calculated thresholds. Finally, the function reports wheter the MCMC has converged or not, the computed quantities and, in case of non convergence, the parameters that failed to achieve the desired thresholds.

Summary of the process in the checkConvergence function. SF is the abbreviation for split frequencies.

Now, let’s see the output from checkConvergence by typing check_bears. The output message includes:

To check the calculated parameters for the continuous parameters type:
Means: output$continuous_parameters$means
ESS: output$continuous_parameters$ess
KS score: output$continuous_parameters$compare_runs

To check the calculated parameters for the splits type:
Frequencies of splits: output$tree_parameters$frequencies
ESS: output$tree_parameters$ess
Difference in frequencies: output$tree_parameters$compare_runs

We can see that check_bears has 4 elements: message, converged, continuous_parameters and tree_parameters.

  1. message: a summary of the convergence assessment;
  2. converged: a boolean that has TRUE if the analysis converged and FALSE if it did not converge;
  3. continuous_parameters: a list with means, ESS and KS-scores between runs;
  4. tree_parameters: a list with frequencies, ESS and split frequencies between runs.

In case the analysis has failed to converge, another element will be on the output list: failed with the parameters that failed the criteria for convergence.

We can generate tables with general information for the continuous parameters the splits with the commands:

printTableContinuous(check_bears)

  means ESS
alpha 1.603954589 9031.384
er.1. 0.015576521 11070.400
er.2. 0.505904352 12114.042
er.3. 0.006423004 6343.636
er.4. 0.009572443 18549.306
er.5. 0.454927376 16764.042
er.6. 0.007596305 5534.844
p_inv 0.532339279 4382.799
pi.1. 0.293497210 10027.691
pi.2. 0.301726522 12104.489
pi.3. 0.135420687 9974.490
pi.4. 0.269355587 11021.611
sr.1. 0.216077996 10773.049
sr.2. 0.550663747 15892.737
sr.3. 1.015625752 11885.672
sr.4. 2.217632528 3654.626
TL 1.112491969 8041.564

printTableSplits(check_bears)

  frequencies ESS
Helarctos_malayanus Ursus_americanus 0.08814119 18594.19
Melursus_ursinus Ursus_arctos Ursus_maritimus 0.17603240 17116.12
Helarctos_malayanus Melursus_ursinus Ursus_americanus Ursus_thibetanus 0.32441756 16318.73
Helarctos_malayanus Ursus_americanus Ursus_arctos Ursus_maritimus Ursus_thibetanus 0.49535046 18692.89
Ursus_americanus Ursus_thibetanus 0.88891111 15668.33

Convenience also provides plot functions to facilitate showing that convergence has been achieved. The functions plotEssContinuous and plotEssSplits plot the histogram of the ESS values for the continuous parameters and the splits, respectively. Typing plotEssContinuous(check_bears) and plotEssSplits(check_bears) yields these figures:

Histograms of the calculated ESS values. The left panel shows the ESS values for the continuous parameters and the right panel show the ESS for the splits. Both plots have the ESS ranges on the x-axis and the frequency on the y-axis. The dotted red line represents the threshold for the ESS.

We can see that, for both figures, all continuous parameters and splits have an ESS above the threshold. This is what we would have expected for a MCMC that has converged. Another function to summarize the achievment of convergence is the function plotKS. It plots a histogram of the KS values for the continuous parameters. The plot for the command plotKS(check_bears) generates this figure:

Histogram of the calculated KS values. The x-axis is the KS values and the y-axis is the frequency. The red dotted line represents the threshold for the KS score.

In this case, all KS values calculated for the continuous parameters are below the threshold. This means that the parameters are drawn from the same distribution for both runs of our analysis. Therefore, if convergence is achieved for a given analysis, the histogram for the KS values should be similiar to this plot. It is also possible to plot the histogram of the KS values for the comparisons one-on-one when the MCMC analysis have more than 2 runs. The function for this plot is plotKS.pooled. The following figure is an example of this plot.

Histogram of the calculated KS values for the one-on-one comparison between runs. Each color represents a different comparison of runs, as shown in the legend. The x-axis is the KS values and the y-axis is the frequency. The red dotted line represents the threshold for the KS score.

In this example, we observe again that all KS values are below the threshold. Which means that, for this criterion, convergence has been achieved. Now that we learned how to use the package and how to interpret the results, let’s practice with some exercises.

Exercise 1


Check for convergence in the output generated in the Nucleotide Substitution Models tutorial.

Which analysis have converged?

Convergence failure


You can see that the output is different when we have a failure in convergence. We have more information in the text output and 2 extra elements failed and failed_names. The element failed has a summary text of what failed in our analysis. The element failed_names has the specifics about the parameters that failed, like the name of the parameter or the split and if it failed in checking for the ESS, the comparison between windows or between runs.

shows the failed message for the convergence assessment on the analysis from Exercise 1 with the GTR+$\Gamma$+I nucleotide substitution model.

The output from the command check_primates$failed showing a summary of the number of parameters that have failed a certain criterion.

Exercise 2


Exercise 3


In this case we should check only the continuous parameters (log files). Because the trees sampled throughout the MCMC have different number of tips due to the fossil record.

What to do


When we face a convergence failure, there are a few options of what to do in our MCMC to overcome this problem. The suggestions here come from experience, rather than theoretical proofs. We can divide our MCMC that lack convergence by the number of parameters that failed:

For the first case, we should run the MCMC longer to get more samples. In the second case, we should increase the weights on the moves or even add more moves for the specific parameters that failed.

  1. Nylander J.A.A., Wilgenbusch J.C., Warren D.L., Swofford D.L. 2008. AWTY (are we there yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics. Bioinformatics. 24:581.
  2. Rambaut A., Drummond A.J., Xie D., Baele G., Suchard M.A. 2018. Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7. Systematic Biology. 67:901–904. 10.1093/sysbio/syy032
  3. Warren D.L., Geneva A., Swofford D.L., Lanfear R. 2016. rwty: R We There Yet. A package for visualizing MCMC convergence in phylogenetics.