StatAlign user's manual

For StatAlign v3.2 – last updated 9 Feb 2015

The most up-to-date version of this document is available from StatAlign's website.

Contents

[
Introduction ]
[ About this manual ]
[ About StatAlign ]
[ The menu bar ]
[ File ]
[ Analysis ]
[ Model ]
[ Help ]
[ Analysis settings ]
[ Output preferences ]
[ MCMC parameters ]
[ The graphical interface panels ]
[ Sequences ]
[ Alignment ]
[ LogLikelihood ]
[ MPD ]
[ Tree ]
[ Consensus tree ]
[ Consensus network ]
[ The protein structure (StructAlign) plugin ]
[ Introduction ]
[ Loading structural data ]
[ Visualising parameter convergence ]
[ Structural annotations on alignment ]
[ Output generated ]
[ References ]
[ The RNA plugin ]
[ Tutorial ]
[ Introduction ]
[ Activating the RNA mode ]
[ RNA mode settings ]
[ RNA mode visualisation ]
[ Information entropy ]
[ Similarity ]
[ RNA mode file output ]

Introduction

[
Tutorial ] [ About this manual ] [ About StatAlign ]

About this manual

This is a comprehensive manual for the graphical interface to StatAlign. Command-line-specific information can be obtained by running

		  java -jar StatAlign.jar -help
		
The documentation on this page can be navigated by using the hyperlinks and the Forward and Backward buttons of your browser. This manual can also be accessed from within StatAlign, via the HelpUser's manual menu. When using the latter method, clicking the Home button reloads this page.

The topics are arranged in hierarchial order, the arrow moves up one level.

The arrow indicates a hyperlink outside of the program. You cannot open such a link if your computer is not connected to the internet. Since these hyperlinks point to third party pages, we cannot guarantee the content of the linked pages. If you discover a broken link, please let us know.

[
Introduction ]

About StatAlign

StatAlign provides a unified approach for inferring multiple sequence alignments, evolutionary trees and model parameters within a joint Bayesian framework.

Most other methods for phylogenetic inference use a single fixed alignment as input. However, this can result in significant bias in the resulting trees, since the results may be very sensitive to the specific choice of alignment. In addition, these methods typically treat insertions and deletions (gaps) as missing data, discarding a great deal of important information in the process, and potentially further biasing the results.

To address these problems, StatAlign jointly samples multiple alignments, evolutionary trees and model parameters under a stochastic model of substitution, insertion and deletion
, making use of a Markov chain Monte Carlo scheme to generate samples from the desired posterior distribution. The program allows a range of different substitution models; the insertion-deletion model is a modification of the TKF92 model , which allows 'refragmentation' at internal nodes. For more details, see Miklós et al. (2008) .

Users can load sequences, and choose a substitution model with which to analyze the sequences. Once the parameters of the Markov chain and desired output are specified (e.g. log-likelihood trace, alignment, tree samples etc.), the MCMC chain can be initialised, generating a set of samples for the parameters of interest, as well as consensus alignments and trees. The following sections contain descriptions of the graphical elements of the program (menu bar, tabulated panels, pop-up dialog windows), with examples illustrating how an analysis can be carried out.

[ Introduction ] [ Contents ]

The menu bar

[
File ] [ Analysis ] [ Model ] [ Help ]

File

In the File menu, we can load sequences (and other types of data if supported by plugins), set the output preferences and exit the program.

Add sequence(s)...

Clicking this menu item will result the pop up of a file opening dialog window. We can browse the directories in this window and open a file. The file to be opened should contain sequences in Fasta file format (and potentially other types of data, when appropriate plugins exist). Sequences can be viewed and removed via the Sequences panel.

Output settings...

Clicking this menu item will pop up a dialog window where you can set up your output preferences

Exit

Exits StatAlign.

[ The menu bar ]

Analysis

Run

After loading your sequences and choosing your Model and Output settings, you can click this menu item to set the MCMC parameters and start the MCMC analysis.

Information will be given about the progress of the MCMC sampling in the Status bar, such as the number of burn-in steps completed and the number of MCMC samples taken. The Postprocessing panels will show both snapshots from the MCMC chain, as well as various summaries calculated from the MCMC samples.

Pause

We can pause an active MCMC run by clicking this button.

Resume

A suspended run can be continued by clicking here.

Stop

This item stops an MCMC run. Output files will still be created from the samples that have been taken so far, but you won't be able to restart the process from the same point. This should be used only if you indeed want to terminate the MCMC run early.

[ The menu bar ]

Model

This menu is for selecting the substitution model that is used for analyzing the sequences. StatAlign automatically recognises the sequences, and if sequences cannot be analysed by the chosen model (for example, protein sequences are forced to be analysed by a nucleotide model), a pop-up error box notifies the user.

Implemented models are:

[ The menu bar ]

Help

The Help menu provides access to help files and other useful information in a browser or Java's mini-browser.
[ The menu bar ] [ Contents ]

Analysis settings

[
Output preferences ] [ MCMC parameters ]

Output preferences

[
Logfile ] [ Postprocess files ] [ Alignment formats ]

When StatAlign completes a run, it outputs files containing the results of the MCMC sampling. By default, the alignment and total log-likelihood for each MCMC sample are written to a .log file, which also contains a report of the acceptance rates for each MCMC move.

Additional MCMC output is generated by postprocessing plugins, which extract properties of interest from each MCMC sample. This output can either be added to the .log file, or sent to individual files.

The table below shows the default set of output files generated:

File extensionContents
.treeSampled trees (Nexus format)
.ctreeConsensus tree taken from samples so far. Internal nodes are labelled with posterior probability for the split preceding the node.
.coreModel.paramsParameters of the core evolutionary model (usually indel and substitution rates)
.llLog likelihood for each MCMC sample. The first column contains the contribution from the core evolutionary model, and the second column contains the total including contributions from model extension plugins.
.mpd.aliContains the minimum-risk (also termed maximum posterior decoding) summary alignment.
.mpd.scoresPer-column marginal posterior probabilities for each alignment column in the MPD alignment.
.aliSampled alignments (several possible formats). By default this file is not created, since the alignments are printed to the .log file instead.

These output preferences can be modified via FileOutput settings.... It will open a pop-up window looking similar to the following:

Below we describe the possible settings in this window.

Logfile

On the top left corner of the output preferences window, users can choose which postprocesses write into the logfile. The format of the output is the following:

		Sample [sample_number] [TAB] postprocess_name: [TAB] data
			

For example, selecting to output just the tree and log-likelihood to the log file, we will obtain a file containing the following:

		Sample 0        Loglikelihood:  -3718.11540541795
		Sample 0        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.04547):0.01,(P1_1ala:0.02321,P1_1avha:0.01238):0.07713):0.01):0.03021,P1_2ran:0.08613);
		Sample 1        Loglikelihood:  -3620.5414903951587
		Sample 1        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.04547):0.01,(P1_1ala:0.02321,P1_1avha:0.07235):0.07713):0.01):0.03021,P1_2ran:0.12924);
		Sample 2        Loglikelihood:  -3618.189925496592
		Sample 2        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.04547):0.01,(P1_1ala:0.02321,P1_1avha:0.11236):0.07713):0.01):0.04352,P1_2ran:0.11424);
		Sample 3        Loglikelihood:  -3615.7331381880526
		Sample 3        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.04547):0.01,(P1_1ala:0.02321,P1_1avha:0.11236):0.07713):0.01837):0.04352,P1_2ran:0.11424);
		Sample 4        Loglikelihood:  -3587.894867267956
		Sample 4        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.06347):0.01,(P1_1ala:0.02321,P1_1avha:0.11236):0.07713):0.01837):0.04352,P1_2ran:0.11424);
		Sample 5        Loglikelihood:  -3585.9665576563943
		Sample 5        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.06347):0.01,(P1_1ala:0.02321,P1_1avha:0.11236):0.08542):0.01837):0.04352,P1_2ran:0.11902);
		Sample 6        Loglikelihood:  -3551.4723341911595
		Sample 6        Tree:   ((P1_1aeia:0.05535,((P1_1ann:0.10414,P1_1axn:0.06347):0.05301,(P1_1ala:0.02321,P1_1avha:0.11236):0.08542):0.03414):0.04352,P1_2ran:0.11902);
			.
			.
			.
			

[
Output preferences ]

Postprocess files

As discussed above, users also can choose which postprocesses generate their own output file. This file has its own format, and may be different from the format written to the log file. For example, the tree postprocess writes the sampled trees into a standard Nexus file as its own output file. Model extension plugins may also generate their own output files, and when the extension is selected and activated, the corresponding output options will be visible in the Output preferences dialogue box.

[ Output preferences ]

Alignment formats

Supported alignment formats are:

[ Output preferences ] [ Analysis settings ]

Setting the MCMC parameters

The MCMC parameters dialog window can now be opened by clicking on the tool icon , which opens up the MCMC settings dialogue panel.


In this panel, the following parameters can be set before launching the MCMC run:


[ Analysis settings ] [ Contents ]

The graphical interface panels

[
Sequences ] [ Alignment ] [ LogLikelihood ] [ MPD ] [ Tree ] [ Consensus tree ] [ Consensus network ]

Sequences (and other data)

Sequences can be loaded from the menu, following FileAdd sequence(s)..., or by selecting the link in the welcome page.


The loaded sequences are shown on the Sequences tabulated panel, in standard Fasta format. For easier visualisation, the name of the sequences are highlighted with dark blue, and printed with different type of fonts.


To remove a sequence from the list, click into the desired sequence and then to the Remove button. Adding sequences can be done simply by following FileAdd sequence(s)....
The sequences are automatically recognised if they are nucleotide or protein sequences, and a default substitution model is associated to them. If the sequences are not recognised, an error message is shown, indicating what are the unknown characters in the input sequences.

When other types of data are added and associated with each sequence, for example
protein structures, these will also appear alongside the sequence name.

[ The Post-processing tabulated panels ]

Alignment

The MCMC algorithm takes a random walk on the joint distribution of alignments, trees and evolutionary parameters. The current alignment can be seen on the Alignment panel. Above the alignment is shown the posterior probability of each alignment column, as estimated from the samples taken so far. This is shown in blue once the burn-in period is over.


[
The Post-processing tabulated panels ]

LogLikelihood

The log-likelihood trace is printed onto the screen, to assist with assessing convergence of the chain. The burn-in phase is coloured in red, and the post burnin-in phase in blue:


Note that convergence of the log-likelihood is not sufficient to judge convergence of the entire MCMC chain, but lack of convergence of the log-likelihood certainly indicates that the chain has not converged. Once the log-likelihood has converged, the distributions of other parameters can be checked to determine whether convergence has been achieved.

[
The Post-processing tabulated panels ]

Maximum Posterior Decoding alignment

The program estimates summarises the alignment samples so far to produce a type of consensus alignment, using maximum posterior decoding (MPD) . The posterior probabilities for each column are printed on top of the alignment, indicating the reliability of the corresponding MPD alignment column.


The MPD alignment is estimated using only alignment samples from the after-burn-in period, hence the alignment is not shown during the burn-in phase in this panel.

[ The Post-processing tabulated panels ]

Tree

This panel shows the current tree in the Markov chain.


[
The Post-processing tabulated panels ]

Consensus tree

This panel shows the majority consensus tree constructed from all the tree samples taken so far


[
The Post-processing tabulated panels ]

Consensus network

This is similar to the consensus tree, except that unresolved splits are shown as cycles in a graph, indicating all sampled relationships.


[
The Post-processing tabulated panels ] [ Contents ]

The protein structure (StructAlign) plugin

[
Introduction ]
[ Loading structural data ]
[ Visualising parameter convergence ]
[ Structural annotations on alignment ]
[ Output generated ]
[ References ]

Introduction

The protein structure plugin, StructAlign, extends the evolutionary model of StatAlign, allowing for protein structures to be used to help with estimating alignments and phylogenies. Since protein structures typically diverge much more slowly than sequences, this enables evolutionary relationships to be more accurately and reliably recovered even for distantly related proteins. StructAlign also estimates branch-specific rates of structural evolution, which can be used to help infer episodes of increased or decreased selective pressure on structures. (See
References for more details.)

Here we describe the StructAlign options available through the graphical interface. Additional options may be available via the command line interface, and can be listed by running
java -jar StatAlign.jar -help:structal

Loading structural data

The current version of the StructAlign plugin requires that each sequence have an associated structure before the MCMC run can be started. (In future versions it will also be possible to run with structures associated with only some of the sequences.)

Protein structures can be loaded and associated with sequences in one of two ways:
  1. reading in sequences and structures directly from a .pdb file
  2. separately or simultaneously loading sequences, and structures from a .coor file
The first method is the simplest way to read in sequences and structures together, and simply requires choosing FileAdd sequence(s)..., and selecting a valid PDB
file from the filesystem. This file should contain only a single chain. If multiple conformations are present for any atoms then the first conformation will be used. The alpha carbon atom is used as the location for each residue. Several structures can be added simultaneously using this method, by selecting multiple PDB files at the same time, by clicking and using the Ctrl key (or the Command key on a Mac).



The second method requires that sequence files already be loaded, or selected and read in at the same time as the structures. Although this method requires more preliminary steps than the PDB-based method, it allows for additional flexibility for users with advanced requirements. The coordinates of the protein structures should be contained in a .coor file with the following format:
>name1
36.587	49.012	31.672	 54.01
35.023	49.494	28.184	 53.01
...     ...     ...      ...
>name2
58.559	46.653	71.532	 62.34
57.828	45.597	67.975	 55.06
...     ...     ...      ...
The first three columns represent the x, y and z coordinates of a single atom associated with each residue in the structure. Typically this will be the alpha carbon. The name of the structure should appear before the start of the coordinates, and the names (name1 and name2 in the above example) should match the names of the sequences to which the structures will be associated. The fourth column is optional, and contains the crystallographic B-factors associated with each triplet of atomic coordinates. If present, the B-factor data is used to allow structural heterogeneity to be incorporated into the evolutionary model. (This information is included by default when structures are read in from a PDB file.)



As of StructAlign v1.1 (released with StatAlign v3.2), analysis is also possible in cases where structures are available only for subsets of the sequences in the dataset, provided that at least two structures are included.

Whichever way structures are read in, summary information will be displayed alongside the associated sequence in the Sequence panel. When each sequence has an associated structure, the MCMC run can be started.

[ The protein structure (StructAlign) plugin ]

Visualising parameter convergence

To assist with monitoring convergence of the parameters associated with the structural model, there is a GUI panel that shows trace plots for each parameter. Values in the burn-in period are shown in red, and after the burn-in, in blue. When all of these trace plots have stabilised, this is a good indicator that the MCMC chain has converged to the desired stationary distribution. Also shown in the top-left of these plots is the current acceptance ratio for MCMC moves on each parameter. Ideally these should end up between 0.2 and 0.4. If any of these values are significantly outside this range once the burn-in is complete, it may imply that a longer
burn-in period should be used.

Structural annotations on alignment

As an additional aid to analysing the output, two additional tracks are added above the current alignment. These correspond to the average pairwise RMSD for each column (red), and the predicted heterogeneity derived from crystallographic B-factors when available (green). These are shown alongside the posterior probability of the alignment column (blue). When the RMSD postprocessing plugin is enabled, the pairwise RMSD and B-factor associated with each column in the maximum likelihood alignment are output to a .rmsd.mle file.

[ The protein structure (StructAlign) plugin ]

Output generated

For each run with the StructAlign plugin enabled, a set of additional output files is generated. These files all contain as the base name a prefix corresponding to the name of the first data file read in, denoted by FILE below. The .struct.params file contains posterior samples for each of the unknown parameters that are estimated as part of the StructAlign model:

Parameter nameMeaning
τOverall structural variance (similar to squared radius of gyration, units of Å2)
εAmount of structural variability attributed to background (non-evolutionary) fluctuations
σ2gGlobal structural diffusivity (Å2 / subst. per site)
σ2kBranch-specific structural diffusivity (Å2 / subst. per site)
νVariance of branch-specific diffusivity parameters (on a log scale)

The first line of the .struct.params file contains the names of the parameters, and each column contains a set of posterior samples for the specified parameter. Each row corresponds to one MCMC iteration, beginning after the burn-in period has ended.

The RMSD trace can be switched on via the
Output settings menu. The maximum likelihood superposition is outputted to a PDB file, where each structure is present as a separate chain, beginning at 'A'. This can be read by most structural viewing programs, for example VMD .



[ The protein structure (StructAlign) plugin ] [ Contents ]

References

  1. Herman JL, Challis CJ, Novák Á, Hein J and Schmidler, SC (2014) Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Molecular Biology and Evolution, 31(9):2251-2266. PubMed MBE website
  2. Herman JL, Novák Á, Challis CJ, Schmidler SC and Hein J, StatAlign 3: statistical alignment and phylogenetics with protein structures. (submitted, November 2013)


[ The protein structure (StructAlign) plugin ] [ Contents ]

The RNA plugin

[
Tutorial ]
[ Introduction ]
[ Activating the RNA mode ]
[ RNA mode settings ]
[ RNA mode visualisation ]
[ Information entropy ]
[ Similarity ]
[ RNA mode file output ]

Tutorial

If you are a first time StatAlign user and are interested in RNA secondary structures, the best place to start is probably the
StatAlign RNA tutorial, which is a concise description of the necessary steps to take to perform a typical analysis on RNA sequences.

Introduction

The RNA plugin performs secondary structure predictions from multiple alignment samples generated by StatAlign. Both a SCFG approach (PPfold) and a thermodynamic (RNAalifold) are available.

The advantage of this method of secondary structure prediction is that it considers multipled sampled alignments instead of just a single fixed alignment, as is typically the case with comparative secondary structure prediction.

Here we describe the options available through the graphical interface. Options available through the command line interface can be listed by running

java -jar StatAlign.jar -help:rnaalifold
java -jar StatAlign.jar -help:ppfold

The following flowchart summarises what StatAlign does when RNA mode is enabled:
[ The RNA plugin ]

Activating the RNA mode

Before the RNA mode can be activated, a nucleotide alignment needs to be loaded. Once this is done the 'RNA mode' can activated by clicking on the 'RNA mode' icon below.



[
The RNA plugin ]

RNA mode settings

Activating the RNA mode brings up a settings dialog. Two different prediction methods are available for predicting structures from multiple alignment samples. The first is an SCFG (Stochastic Context-Free Grammar) approach as implemented in PPfold ("Sampling and averaging (PPfold)"). Two predictions are produced using this approach: the standard sampling and averaging prediction, which produces a secondary structure from an averaged base-pairing probability matrix and a consensus evolutionary approach, which provides an information entropy value that takes into account contributions from alignment samples in addition to providing a secondary structure prediction.

A thermodynamic method ("Sampling and averaging (RNAalifold)") can also be used, this method requires that user specify the RNAalifold executable (this shouldn't be necessary on Windows). This method allows the user to specify various folding parameters such as the folding temperature and the genome conformation. The StatAlign GUI, however, only provides some of these options. By running StatAlign from the command-line, additional RNAalifold parameters can be specified (see the RNAalifold manpage for a list of parameters).




The GUI RNAalifold options are as follows:

[
The RNA plugin ]

RNA mode run

Once the RNA mode is activated it will run alongside the normal StatAlign run.

[ The RNA plugin ]

RNA mode visualisation

During and after the run there are various tabs available for secondary structure visualisation. The first is the "Consensus Structure" tab which utilises VARNA (Darty et al. (2009)) to display the consensus secondary structure. Two modes are available: "Normal" mode which displays the nucleotides using a set of 4 colours to represent each base and a "Probability mode" which represents unpaired nucleotides in red and base-paired nucleotides in blue. Where the intensity of the red or blue represents the probability that a nucleotide is unpaired or the probability that a pair of nucleotides are base-paired, respectively.



A second type of visualisation is available in the "Base-pairing matrix" tab which displays the average base-pairing probability matrix for alignment samples taken up to that point in the run. The blue intensity of a cell indicates the expected probability that the particular pair of nucleotides at that position are base-paired in the structure.



Note: only the sampling and averaging methods provide visualisations and by default only the PPfold method is shown - the RNAalifold visualisation is shown if the PPFold method is deactivated.

[
The RNA plugin ]

Information entropy

Information entropy is a measure of the spread of a probability distribution. The PPfold method is a SCFG method (Stochastic Context-Free Grammar method) used to predict secondary structures. This SCFG method places a probability distribution on secondary structures which means that the information entropy of this probability distribution can be calculated (Anderson et. al. (2012), in preparation, available upon request). A low entropy corresponds to a probability distribution where the probability mass is concentrated on a few structures, whereas a high entropy indicates that the probability mass is spread over many structures - this is undesirable as it can be an indication that there are many alternative secondary structures which are almost as equiprobable as the most probable structure.

The information entropy calculation provided in StatAlign has been extended to reflect that fact that the PPfold sampling and averaging method samples over alignments, instead of using a single fixed alignment.

In the 'Entropy' tab, two line graphs are displayed depicting the 'Sample Entropy', which is the information entropy of the PPfold structure predictions on individual alignment samples measured in bits, and 'Consensus Entropy' which is the extended information entropy that takes into account alignment space. The 'Consensus Entropy' should approach a constant as the number of samples increases.

The final information entropy value can be found in the ".info" file on the header line corresponding to the consensus evolutionary prediction.

[
The RNA plugin ]

Alignment similarity

In the 'Similarity' tab, a graph is shown depicting the similarity between the first alignment sample and each of the subsequent alignment samples. This provides a visual representation of auto-correlation between alignment samples. The similarity should gradually decrease and reach a plateau as the time between the first sample and a given alignment becomes large, indicating that enough cycles have been taken between samples that they are no longer significantly auto-correlated.

[
The RNA plugin ]

RNA mode file output

Various secondary structure output files are available for each method in the StatAlign folder at the end of a run.
The naming convention used is: <dataset title>.<method>.<format extension>. The terms are defined as follows:

e.g. 'RNAData5.ppfold.ct' is the maximum posterior decoding structure produced by the PPfold sampling and averaging approach for the dataset 'RNAData5' in connect format.

Finally, a summary file is also available (<dataset title>.info) that lists the maximum posterior decoding consensus structure for each method that was run, along with some metadata fields in the title of each. The fields are described as follows: Example summary file:

[ The RNA plugin ] [ Contents ]