This help page is for developers, and describes how to implement extensions to the StatAlign package, such as new substitution models or postprocessing plugins.
You do not have to understand these topics if you only would like to use StatAlign. The user's manual can be found here .
StatAlign is an extendable Java program that implements Markov chain Monte Carlo (MCMC) sampling of evolutionary trees, sequence alignments and model parameters from a statistical sequence evolution model. Sequence insertions and deletions are described by the TKF92 (Thorne et al., 1992) indel model and substitutions are assumed to follow a continuous-time Markov process. The latter can be parameterised to varying degrees, thus giving rise to different substitution models, of which new ones can also be defined.
Additionally, StatAlign allows the alignment, tree and parameter samples to be processed in innovative ways and to this end provides an easy-to-use interface for postprocessing plugins. The possibilities range from visualising different dimensions of the state of the Markov chain to summarising alignment or tree samples into one consensus entity to running external applications such as RNA folding on the samples with the aim of incorporating alignment uncertainty into the analyses.
In StatAlign, adding functionality that falls into the above categories boils down to creating a single class in a specific plugins package, which extends the abstract superclass corresponding to the plugin type, and making the compiled class available in the classpath together with the StatAlign core package when running. StatAlign's auto-discovery mechanism ensures that the classpath is traversed to locate all available plugins of each type.
Below we describe in detail how to implement new plugins of different types.
StatAlign was written in Java 6 using the Eclipse framework. If you would like to start developing extensions for StatAlign, the first step is to download Eclipse. We recommend an Eclipse distribution that comes with EGit and Eclipse Marketplace support, such as Eclipse for RCP and RAP Developers.
Once you have a working Eclipse installation, use the EGit plugin's Git Repository Exploring Perspective to clone the StatAlign repository at https://github.com/statalign/statalign.git to your project directory. Please refer to EGit's documentation about the details.
Right click the newly cloned repository in the list of repositories and use
Import Projects... This will import StatAlign as a Java project
and you can get straight into coding. Good luck!
Adding substitution models
[Description of the fields in SubstitutionModel]
[Description of the methods of SubstitutionModel]
A novel substitution model has to extend the abstract class SubstitutionModel. Once such a descendant class is developed its compiled class should be copied into the statalign.model.subst.plugins sub-package. StatAlign automatically recognizes any novel substitution model. Recognized models can be seen on the Model menu, and can be selected as the accompanying model of the insertion-deletion model.
double[][] v, w, double[] d
The rate matrix must be diagonalized and represented in a
v d w product, where d is a diagonal matrix.
v and w are two dimensional arrays, d
is a one dimensional double array containing the diagonal values
(ie., the eigenvalues). Indexes of the arrays should agree with the
representation of characters, namely numbers in index 0 should
correspond to the character of the alphabet having code 0, etc.
double[] e
The one dimensional double array, e contains the
equilibrium probabilities. Indexing should be equivalent with
the indexing of arrays v, w and d.
char[] alphabet
The one dimensional char array gives the list of possible characters
that the model accepts. Used in writing nexus alignments.
double[] parameters
The one dimensional double array contains the parameters of the model.
The corresponding methods must be aware which parameter is available
at which index.
String type
This String tells if it is a nucleic acid model or protein model.
Used at creating nexus alignments and also at sorting models in the
Models menu. In the Models menu, first models with type "protein" are
listed then the models with type "nucleotide", and then the rest. Currently the
'rest' part is empty.
SubstitutionModel attachedScoringScheme
This SubstitutionScore tells the corresponding
substitution score class. Note that this class will transform the
characters in the input sequences into arrays containing Felsenstein's
likelihoods. There are two substitution score classes available at the moment
in StatAlign. The Blosum62 class implements the BLOSUM 62 scoring
matrix and transforms non-ambiguous IUPAC one-letter amino acid codes into
Felsentein's likelihood arrays. The DNAScore class implements a simple
scoring matrix for nucleic acids, and transforms ambiguous and non-ambiguous
IUPAC one-letter nucleic acid codes into Felsentein's likelihood arrays.
Codon models are not supported directly in the current version of
StatAlign. A possible indirect way would be to first transform
coding DNA sequences into one-letter codon codes (three nucleic acids are
represented by a single character), and to implement both a
codon SubstitutionScore class that recognizes this one-letter code,
and a corresponding SubstitutionModel class.
double acceptable(RawSequences r)
This function decides if it can accept a set of sequences represented as
RawSequences. r.sequences is a String array that contain the input sequences.
This function might throw a RecognizingError, such a thrown error is
handled by the MainFrame, and yields a pop-up window with the error message.
The standard construction of RecognizingError is this if the
jth character from the ith sequence cannot be recognized:
throw new RecognizingError(getMenuName()+" cannot accept the sequences because it contains character '"+r.sequence[i].charAt(j)+"'!\n");If the model can accept sequences, then it returns with a number with between 0 and 1, depending on how much the model likes the sequences. For example, sequences containing only 'a's and 'c's might be both DNA and protein sequences, though it is more likely that they are DNA sequences.
String getMenuName()
This function should return with a String containing the name of the model.
This name will appear in the menu Model.
double sampleParameter()
This function should propose a change in the parameters and should return with
the logarithm of quotient of backproposal and proposal probabilities.
The v, w, d matrices and the equilibrium array e must be apdated
according to the proposed parameters. Old values of the parameters
must be stored in case of rejection of the proposal.
void restoreParameter()
This function has to restore the old parameter values and
the v, w, d matrices and the equilibrium array e accordingly.
String print()
This function returns with a String reporting the current parameter values.
Color getColor(char c)
This function returns with the background color of character c. Used at
printing alignments on the screen.
char mostLikely(double[] seq)
This function receives a Felsenstein's likelihood array and returns with
the most likely character. Used at printing alignments, for example, in
ancestral nodes of the tree.
Developing additional postprocessing panels
[Description of the fields in Postprocess]
[Description of the methods of Postprocess]
[Examples]
A novel postprocessing class has to extend the abstract class Postprocess. Once such a descendant class is developed its compiled class should be copied into the statalign.postprocess.plugins sub-package. StatAlign automatically recognizes any novel postprocessing method. Recognized methods has a tabulated panel and added to the main frame. They also can be selected to generate their own output file in the Output preferences pop-up window.
boolean selected
True if plugin is selected in the menu (and thus a tab is created for the plugin in
the main window that can be used to allow the user to change settings before MCMC
start and to show runtime information afterwards).
boolean screenable
True if it can generate a GUI. Not used in the current version, it is
for further development if one wants to switch on and off the GUIs.
boolean active
True if plugin is active (must produce its output) either because other plugins depend
on it or because it is selected.
boolean outputable
True if this class can generate an output.
boolean postprocessable
True if this class can generate a postprocess file.
boolean sampling
True if it writes into the log file.
boolean postprocessWrite
True if it writes a postprocess file.
String alignmentType
This string tells the alignment type in which alignment must be presented.
FileWriter file
This is the logfile writer that is written during the running and gets information from
all postprocesses.
FileWriter outputFile
This is the output file writer, that is written by a specific postprocess plugin.
Mcmc mcmc
String getTabName()
Icon getIcon()
JPanel getJPanel()
String getTip()
String[] getDependences()
void refToDependences(Postprocess[] plugins)
void beforeFirstSample()
void newStep()
void newSample(int no, int total)
void setSampling(boolean enabled)
void afterLastSample()
The MpdAlignment plugin is screenable, outputable and
postprocessable. The CurrentAlignment plugin is
screenable, outputable, but not postprocessable.
The CurrentAlignment and MpdAlignment plugins
use the same GUI, the AlignmentGUI class.
The LogLikelihoodTrace plugin uses a LogLikelihoodTraceContainer
class to store information about loglikelihoods and whether the
loglikelihood value comes from a burn-in phase or after burn-in.
The Mcmc class has a public boolean variable burnin which is true if the
MCMC run is in the burn-in phase.
If you would like to develop a plugin that predicts secondary structures
mapping one known structure to other sequences, you can create a plugin that
depends on the CurrentAlignment plugin. CurrentAlignment
has two public String arrays, allAlignment and leafAlignment from which you
can read the multiple alignment of all sequences (both at the leaves of the
tree and at the internal nodes) and the multiple alignment of sequences
at the leaves.
You can get further information about the structure of StatAlign's code and
the description of classes, methods and variables at StatAlign's Javadoc page.
To browse or download the complete source code, please visit StatAlign's github page
This is the current Mcmc object. Use this to access the current state of the Mcmc
via mcmc.tree.
Description of the methods in Postprocess
This function returns with the name of the method. This will
written to the tab of its panel.
This function returns with the icon of the method. This will
appear on the tab of its panel.
Returns with the panel of the GUI. If necessary, the developers can
develop their plugin's own GUI, w esuggest that such a GUI be put into
the statalign.postprocess.gui subpackage.
Returns with the tip information (shown when the mouse cursor is moved over the
label of the tabulated panel).
Override this and return an array of full-qualified class names of the plugins
this plugin depends on.
Override this to get access to instances of the plugins your plugin depends on.
This function will be called by the PostprocessManager during its initialisation.
The parameter plugins is a reference to Postprocess objects in the order
they are specified in getDependences() or null if getDependences() returns with null.
Called before the MCMC starts. This is the first time the prostprocess plugin can
use PostprocessManager.mcmc to access internal data structure.
Called whenever a new step is made. A typical run of MCMC takes hundred thousands of
steps, override this function only if it takes a negligible amount of time and does not
use too much memory. We use, for example, in drawing the loglikelihood trace.
This function is called when we sample from the Markov chain.
Parameter no is the number of the current sample, total is the number
of the total samples.
This function switches on or off the sampling mode. The parameter
enabled is set to true if samples are needed.
This function is called after the MCMC runs.
Examples
You can use the beforeFirstSample() function to interact with the users and
ask for an additional file from which the known secondary structures might
be read.
Javadocs and source code