supermatcher

 

Function

Finds a match of a large sequence against one or more sequences

Description

This is a rough and ready local alignment program for large sequences. The reason it is rough and ready is that wordmatch is used to find all the wordmatches between the first sequence and another sequence. Then by calculating the highest score for a diagonal we can then use this as the centre point for a Smith-Waterman type calculation of a width given by the user. So a narrow diagonal Smith-Waterman is calculated hence the results will be rough but due to the space saving much larger sequences can be aligned.

Usage

Here is a sample session with supermatcher.

% supermatcher tembl:ec\* tembl:eclac -word 50 -sbegin2 101 -send2 -101
Finds a match of a large sequence against one or more sequences
Gap opening penalty [10.0]: 3.0
Gap extension penalty [0.5]:
Output alignment [eclac.supermatcher]:

Command line arguments

   Mandatory qualifiers:
  [-seqa]              seqall     Sequence database USA
  [-seqb]              seqset     Sequence set USA
   -gapopen            float      Gap opening penalty
   -gapextend          float      Gap extension penalty
   -outfile            align      Output alignment file name

   Optional qualifiers:
   -datafile           matrixf    This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -width              integer    Alignment width
   -wordlen            integer    word length for initial matching
   -errorfile          outfile    Error file to be written to

   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-seqa]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-seqb]
(Parameter 2)
Sequence set USA Readable sequences Required
-gapopen Gap opening penalty Number from 1.000 to 100.000 10.0 for any sequence type
-gapextend Gap extension penalty Number from 0.100 to 10.000 0.5 for any sequence type
-outfile Output alignment file name Alignment file  
Optional qualifiers Allowed values Default
-datafile This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-width Alignment width Any integer value 16
-wordlen word length for initial matching Integer 3 or more 6
-errorfile Error file to be written to Output file supermatcher.error
Advanced qualifiers Allowed values Default
(none)

Input file format

Two sequence USAs.

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/AlignFormats.html for further information on alignment formats.

The output from the example follows:

########################################
# Program:  supermatcher
# Rundate:  Mon May 20 16:32:00 2002
# Report_file: eclac.supermatcher
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: ECLAC
# 2: ECLAC
# Matrix: EDNAFULL
# Gap_penalty: 3.0
# Extend_penalty: 0.5
#
# Length: 7277
# Identity:    7277/7277 (100.0%)
# Similarity:  7277/7277 (100.0%)
# Gaps:           0/7277 ( 0.0%)
# Score: 36385.0
#
#
#=======================================

ECLAC            101 atgtcgcagagtatgccggtgtctcttatcagaccgtttcccgcgtggtg    150
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            101 atgtcgcagagtatgccggtgtctcttatcagaccgtttcccgcgtggtg    150

ECLAC            151 aaccaggccagccacgtttctgcgaaaacgcgggaaaaagtggaagcggc    200
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            151 aaccaggccagccacgtttctgcgaaaacgcgggaaaaagtggaagcggc    200

ECLAC            201 gatggcggagctgaattacattcccaaccgcgtggcacaacaactggcgg    250
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            201 gatggcggagctgaattacattcccaaccgcgtggcacaacaactggcgg    250

ECLAC            251 gcaaacagtcgttgctgattggcgttgccacctccagtctggccctgcac    300
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            251 gcaaacagtcgttgctgattggcgttgccacctccagtctggccctgcac    300

ECLAC            301 gcgccgtcgcaaattgtcgcggcgattaaatctcgcgccgatcaactggg    350
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            301 gcgccgtcgcaaattgtcgcggcgattaaatctcgcgccgatcaactggg    350

ECLAC            351 tgccagcgtggtggtgtcgatggtagaacgaagcggcgtcgaagcctgta    400
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            351 tgccagcgtggtggtgtcgatggtagaacgaagcggcgtcgaagcctgta    400

ECLAC            401 aagcggcggtgcacaatcttctcgcgcaacgcgtcagtgggctgatcatt    450
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            401 aagcggcggtgcacaatcttctcgcgcaacgcgtcagtgggctgatcatt    450

ECLAC            451 aactatccgctggatgaccaggatgccattgctgtggaagctgcctgcac    500
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            451 aactatccgctggatgaccaggatgccattgctgtggaagctgcctgcac    500

ECLAC            501 taatgttccggcgttatttcttgatgtctctgaccagacacccatcaaca    550
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLAC            501 taatgttccggcgttatttcttgatgtctctgaccagacacccatcaaca    550


...........  etc. .............


The file 'supermatcher.error' will contain any errors that occured during the program. This may be that wordmatch could not find any matches hence no suitable start point is found for the smith-waterman calculation.

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAMAT is used. Others can be specified.

Notes

The time this program takes to do an alignment depends very much on the word size. For short sequences a short word size (e.g. 4) can make it take a very long time. Large word sizes (e.g. 30) for sequences that are very similar give a very quick result. The default of 16 should give reasonable fast alignments.

Because it does a Smith & Waterman alignment (albeit in a narrow region around the diagonal shown to be the 'best' by a word match), this program can use huge amounts of memory if the sequences are large.

Because the alignment is made within a narrow area each side of the 'best' diagonal, if there are sufficient indels between the two sequences, then the path of the Smith & Waterman alignment can wander outside of this area. Making the width larger can avoid this problem, but you then use more memory.

The longer the sequences and the wider the specified alignment width, the more memory will be used.

If the program terminates due to lack of memory you can try the following:

Run the UNIX command 'limit' to see if your stack or memory usage have been limited and if so, run 'unlimit', (e.g.: '% unlimit stacksize').

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
matcherFinds the best local alignments between two sequences
seqmatchallDoes an all-against-all comparison of a set of sequences
waterSmith-Waterman local alignment
wordmatchFinds all exact matches of a given size between 2 sequences

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Finished.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments