Minimum Message Length (MML) based aligner of protein (amino acid) sequences
Reference: D. Sumanaweera,  L. Allison,  &  A. S. Konagurthu, Statistical compression of protein sequences and inference of marginal probability landscapes over competing alignments using finite state models and Dirichlet priors, Bioinformatics. 35(14): i360-i369 (2019) [DOI link] [PDF]
( News: Dinithi won the ISCB's 2019 Ian Lawson Van Toch Memorial Award for Outstanding Student Paper [photo])

seqMMLigner Program


Source Code (version 1.2)

(md5sum: db2dd4753d7445842b20e2bcd80a3728 )
Instructions for running seqMMLigner

seqmmligner [sequence-1 fasta file] [sequence-2 fasta] [OPTION]...
Align two amino acid sequence fasta FILEs (default operation).
--criterion [optimal/marginal/marginal-interactive]
optimal: (default) finds optimal alignment (and infers associated params) under the information-theoretic measure (i.e., finds the alignment that maximizes the joint probability of the alignment and the two input sequences)
marginal: finds the marginal probability (and infers associated params) that the two input sequences are related, under the information-theoretic measure
marginal-interactive: same as above, but allows interactive exploration of the marginal probability landscape and probe competing sequence alignments

Score any SEQUENCE_ALIGNMENT_FILE (in fasta format) using I-value. Changes default operation: an alignment is not computed. Instead, SEQUENCE_ALIGNMENT_FILE (FASTA format) is scored using the I-value measure.

Some example command line runs:
Generating an alignment using seqMMLigner:
./seqmmligner seq1.fa seq2.fa
./seqmmligner seq1.fa seq2.fa --criterion marginal
./seqmmligner --ivalue alignment.afasta
Instructions for building seqMMLigner (v1.2)

Dependencies: GNUMake or equivalent. A modern C++ compiler. seqMMLigner is known to build with g++ (GCC) >= 4.1.2. If these dependencies are met, follow these instructions:

  1. Download the source code from the link above.
  2. Extract the archive with: tar -zxf seqmmligner_1.2.tgz
  3. Type: cd seqmmligner_1.2/
  4. Build seqMMLigner with: make
  5. The built binary, seqmmligner, will appear in the bin/ subdirectory.
Copyright license

seqMMLigner is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License.

seqMMLigner is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with seqMMLigner.  If not, see
Bug reports

Please contact the following people for bug reports, web page errors, or questions:
  • Arun Konagurthu <arun DOT konagurthu AT monash DOT edu>
  • Dinithi Sumanaweera <dinithi DOT sumanaweera AT monash DOT edu>

Supplementary Material

Main and supplementary PDFs

main+sm (combined PDF)
Supporting data:

SCOP domain sequence pairs used to infer Dirichlet priors: (click here)

Distribution of #alignments vs sequence-distance parameter (i.e. n of PAM-n): (click here)

Inferred Dirichlet (priors') parameters for sequence-distance parameter n in [1,1000]: (click here)

A selection of marginal probability landscapes: (click here)

Benchmark statistics across the programs seqMMLigner, ClustalW, CONTRAlign, KAlign, MAFFT, MUSCLE, ProbCons, T-Coffee:
  • Human fungal mitrochondrial proteins (remote ortholog) data set: (click here)

  • SABMark "Twilight" zone (twi) data set: (click here)