StatAlign – Developer's manual

StatAlign developer's manual

For StatAlign v3.2 – last updated 9 Feb 2015

The most up-to-date version of this document is available from StatAlign's website.

[Introduction]
[About StatAlign]
[Getting started]
[Adding substitution models]
[Developing additional postprocessing panels]
[Javadocs and source code]

Introduction

This help page is for developers, and describes how to implement extensions to the StatAlign package, such as new substitution models or postprocessing plugins.

You do not have to understand these topics if you only would like to use StatAlign. The user's manual can be found here .

About StatAlign

StatAlign is an extendable Java program that implements Markov chain Monte Carlo (MCMC) sampling of evolutionary trees, sequence alignments and model parameters from a statistical sequence evolution model. Sequence insertions and deletions are described by the TKF92 (Thorne et al., 1992) indel model and substitutions are assumed to follow a continuous-time Markov process. The latter can be parameterised to varying degrees, thus giving rise to different substitution models, of which new ones can also be defined.

Additionally, StatAlign allows the alignment, tree and parameter samples to be processed in innovative ways and to this end provides an easy-to-use interface for postprocessing plugins. The possibilities range from visualising different dimensions of the state of the Markov chain to summarising alignment or tree samples into one consensus entity to running external applications such as RNA folding on the samples with the aim of incorporating alignment uncertainty into the analyses.

In StatAlign, adding functionality that falls into the above categories boils down to creating a single class in a specific plugins package, which extends the abstract superclass corresponding to the plugin type, and making the compiled class available in the classpath together with the StatAlign core package when running. StatAlign's auto-discovery mechanism ensures that the classpath is traversed to locate all available plugins of each type.

Below we describe in detail how to implement new plugins of different types.

Getting started

StatAlign was written in Java 6 using the Eclipse framework. If you would like to start developing extensions for StatAlign, the first step is to download Eclipse. We recommend an Eclipse distribution that comes with EGit and Eclipse Marketplace support, such as Eclipse for RCP and RAP Developers.

Once you have a working Eclipse installation, use the EGit plugin's Git Repository Exploring Perspective to clone the StatAlign repository at https://github.com/statalign/statalign.git to your project directory. Please refer to EGit's documentation about the details.

Right click the newly cloned repository in the list of repositories and use Import Projects... This will import StatAlign as a Java project and you can get straight into coding. Good luck!

Adding substitution models

[Description of the fields in SubstitutionModel]
[Description of the methods of SubstitutionModel]

A novel substitution model has to extend the abstract class SubstitutionModel. Once such a descendant class is developed its compiled class should be copied into the statalign.model.subst.plugins sub-package. StatAlign automatically recognizes any novel substitution model. Recognized models can be seen on the Model menu, and can be selected as the accompanying model of the insertion-deletion model.

Description of the fields in `SubstitutionModel`

double[][] v, w, double[] d
The rate matrix must be diagonalized and represented in a v d w product, where d is a diagonal matrix. v and w are two dimensional arrays, d is a one dimensional double array containing the diagonal values (ie., the eigenvalues). Indexes of the arrays should agree with the representation of characters, namely numbers in index 0 should correspond to the character of the alphabet having code 0, etc.

double[] e
The one dimensional double array, e contains the equilibrium probabilities. Indexing should be equivalent with the indexing of arrays v, w and d.

char[] alphabet
The one dimensional char array gives the list of possible characters that the model accepts. Used in writing nexus alignments.

double[] parameters
The one dimensional double array contains the parameters of the model. The corresponding methods must be aware which parameter is available at which index.

String type
This String tells if it is a nucleic acid model or protein model. Used at creating nexus alignments and also at sorting models in the Models menu. In the Models menu, first models with type "protein" are listed then the models with type "nucleotide", and then the rest. Currently the 'rest' part is empty.

SubstitutionModel attachedScoringScheme
This SubstitutionScore tells the corresponding substitution score class. Note that this class will transform the characters in the input sequences into arrays containing Felsenstein's likelihoods. There are two substitution score classes available at the moment in StatAlign. The Blosum62 class implements the BLOSUM 62 scoring matrix and transforms non-ambiguous IUPAC one-letter amino acid codes into Felsentein's likelihood arrays. The DNAScore class implements a simple scoring matrix for nucleic acids, and transforms ambiguous and non-ambiguous IUPAC one-letter nucleic acid codes into Felsentein's likelihood arrays. Codon models are not supported directly in the current version of StatAlign. A possible indirect way would be to first transform coding DNA sequences into one-letter codon codes (three nucleic acids are represented by a single character), and to implement both a codon SubstitutionScore class that recognizes this one-letter code, and a corresponding SubstitutionModel class.

Description of the methods of `SubstitutionModel`

double acceptable(RawSequences r)
This function decides if it can accept a set of sequences represented as RawSequences. r.sequences is a String array that contain the input sequences. This function might throw a RecognizingError, such a thrown error is handled by the MainFrame, and yields a pop-up window with the error message. The standard construction of RecognizingError is this if the jth character from the ith sequence cannot be recognized:

throw new RecognizingError(getMenuName()+" cannot accept the sequences because it contains character '"+r.sequence[i].charAt(j)+"'!\n");

If the model can accept sequences, then it returns with a number with between 0 and 1, depending on how much the model likes the sequences. For example, sequences containing only 'a's and 'c's might be both DNA and protein sequences, though it is more likely that they are DNA sequences.

String getMenuName()
This function should return with a String containing the name of the model. This name will appear in the menu Model.

double sampleParameter()
This function should propose a change in the parameters and should return with the logarithm of quotient of backproposal and proposal probabilities. The v, w, d matrices and the equilibrium array e must be apdated according to the proposed parameters. Old values of the parameters must be stored in case of rejection of the proposal.

void restoreParameter()
This function has to restore the old parameter values and the v, w, d matrices and the equilibrium array e accordingly.

String print()
This function returns with a String reporting the current parameter values.

Color getColor(char c)
This function returns with the background color of character c. Used at printing alignments on the screen.

char mostLikely(double[] seq)
This function receives a Felsenstein's likelihood array and returns with the most likely character. Used at printing alignments, for example, in ancestral nodes of the tree.

Developing additional postprocessing panels

[Description of the fields in Postprocess]
[Description of the methods of Postprocess]
[Examples]

A novel postprocessing class has to extend the abstract class Postprocess. Once such a descendant class is developed its compiled class should be copied into the statalign.postprocess.plugins sub-package. StatAlign automatically recognizes any novel postprocessing method. Recognized methods has a tabulated panel and added to the main frame. They also can be selected to generate their own output file in the Output preferences pop-up window.

Description of the fields in `Postprocess`

boolean selected
True if plugin is selected in the menu (and thus a tab is created for the plugin in the main window that can be used to allow the user to change settings before MCMC start and to show runtime information afterwards).

boolean screenable
True if it can generate a GUI. Not used in the current version, it is for further development if one wants to switch on and off the GUIs.

boolean active
True if plugin is active (must produce its output) either because other plugins depend on it or because it is selected.

boolean outputable
True if this class can generate an output.

boolean postprocessable
True if this class can generate a postprocess file.

boolean sampling
True if it writes into the log file.

boolean postprocessWrite
True if it writes a postprocess file.

String alignmentType
This string tells the alignment type in which alignment must be presented.

FileWriter file
This is the logfile writer that is written during the running and gets information from all postprocesses.

FileWriter outputFile
This is the output file writer, that is written by a specific postprocess plugin.

Mcmc mcmc
This is the current Mcmc object. Use this to access the current state of the Mcmc via mcmc.tree.

Description of the methods in `Postprocess`

String getTabName()
This function returns with the name of the method. This will written to the tab of its panel.

Icon getIcon()
This function returns with the icon of the method. This will appear on the tab of its panel.

JPanel getJPanel()
Returns with the panel of the GUI. If necessary, the developers can develop their plugin's own GUI, w esuggest that such a GUI be put into the statalign.postprocess.gui subpackage.

String getTip()
Returns with the tip information (shown when the mouse cursor is moved over the label of the tabulated panel).

String[] getDependences()
Override this and return an array of full-qualified class names of the plugins this plugin depends on.

void refToDependences(Postprocess[] plugins)
Override this to get access to instances of the plugins your plugin depends on. This function will be called by the PostprocessManager during its initialisation. The parameter plugins is a reference to Postprocess objects in the order they are specified in getDependences() or null if getDependences() returns with null.

void beforeFirstSample()
Called before the MCMC starts. This is the first time the prostprocess plugin can use PostprocessManager.mcmc to access internal data structure.

void newStep()
Called whenever a new step is made. A typical run of MCMC takes hundred thousands of steps, override this function only if it takes a negligible amount of time and does not use too much memory. We use, for example, in drawing the loglikelihood trace.

void newSample(int no, int total)
This function is called when we sample from the Markov chain. Parameter no is the number of the current sample, total is the number of the total samples.

void setSampling(boolean enabled)
This function switches on or off the sampling mode. The parameter enabled is set to true if samples are needed.

void afterLastSample()
This function is called after the MCMC runs.

Examples

The MpdAlignment plugin is screenable, outputable and postprocessable. The CurrentAlignment plugin is screenable, outputable, but not postprocessable.
The CurrentAlignment and MpdAlignment plugins use the same GUI, the AlignmentGUI class.
The LogLikelihoodTrace plugin uses a LogLikelihoodTraceContainer class to store information about loglikelihoods and whether the loglikelihood value comes from a burn-in phase or after burn-in. The Mcmc class has a public boolean variable burnin which is true if the MCMC run is in the burn-in phase.
If you would like to develop a plugin that predicts secondary structures mapping one known structure to other sequences, you can create a plugin that depends on the CurrentAlignment plugin. CurrentAlignment has two public String arrays, allAlignment and leafAlignment from which you can read the multiple alignment of all sequences (both at the leaves of the tree and at the internal nodes) and the multiple alignment of sequences at the leaves.
You can use the beforeFirstSample() function to interact with the users and ask for an additional file from which the known secondary structures might be read.

Javadocs and source code

You can get further information about the structure of StatAlign's code and the description of classes, methods and variables at StatAlign's Javadoc page.

To browse or download the complete source code, please visit StatAlign's github page