matcher

 

Function

Finds the best local alignments between two sequences

Description

matcher compares two sequences looking for local sequence similarities using a rigorous algorithm.

matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996

Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program, which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Like water, matcher is rigorous, but also very slow. The advantage of matcher is that it uses far less memory than water, so you are much less likely to run out of memory when aligning large sequences.

matcher will also report a specified number of alignments between the two sequences showing the actual local alignments. (water will only report the single best match.) The default number of alignments output is 1, but can be increased to (for example) the 10 best alignments by using the '-alternatives 10' command-line qualifier. In some cases, for example multidomain proteins or cDNA and genomic DNA comparisons, there may be many interesting and significant alignments.

Usage

Here is a sample session with matcher.

% matcher tsw:hba_human tsw:hbb_human
Finds the best local alignments between two sequences
Output file [hba_human.matcher]: 

Here is an example to find the 10 best alignments:

% matcher tsw:hba_human tsw:hbb_human -alt 10
Finds the best local alignments between two sequences
Output file [hba_human.matcher]: hba_human.matcher10

Command line arguments

   Mandatory qualifiers:
  [-sequencea]         sequence   Sequence USA
  [-sequenceb]         sequence   Sequence USA
  [-outfile]           align      Output alignment file name

   Optional qualifiers:
   -datafile           matrix     This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -alternatives       integer    This sets the number of alternative matches
                                  output. By default only the highest scoring
                                  alignment is shown. A value of 2 gves you
                                  other reasonable alignments. In some cases,
                                  for example multidomain proteins of cDNA and
                                  gemomic DNA comparisons, there may be other
                                  interesting and significant alignments.
   -gappenalty         integer    The gap penalty is the score taken away when
                                  a gap is created. The best value depends on
                                  the choice of comparison matrix. The
                                  default value of 14 assumes you are using
                                  the EBLOSUM62 matrix for protein sequences,
                                  or a value of 16 and the EDNAFULL matrix for
                                  nucleotide sequences.
   -gaplength          integer    The gap length, or gap extension, penalty is
                                  added to the standard gap penalty for each
                                  base or residue in the gap. This is how long
                                  gaps are penalized. Usually you will expect
                                  a few long gaps rather than many short
                                  gaps, so the gap extension penalty should be
                                  lower than the gap penalty. An exception is
                                  where one or both sequences are single
                                  reads with possible sequencing errors in
                                  which case you would expect many single base
                                  gaps. You can get this result by setting
                                  the gap penalty to zero (or very low) and
                                  using the gap extension penalty to control
                                  gap scoring.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequencea]
(Parameter 1)
Sequence USA Readable sequence Required
[-sequenceb]
(Parameter 2)
Sequence USA Readable sequence Required
[-outfile]
(Parameter 3)
Output alignment file name Alignment file  
Optional qualifiers Allowed values Default
-datafile This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-alternatives This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. Integer 1 or more 1
-gappenalty The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. Positive integer 14 for protein, 16 for nucleic
-gaplength The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Positive integer 4 for any sequence
Advanced qualifiers Allowed values Default
(none)

Input file format

Any 2 sequence USAs or the same type (DNA or protein).

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/AlignFormats.html for further information on alignment formats.

Here is the output for the example:


########################################
# Program:  matcher
# Rundate:  Mon May 20 17:12:29 2002
# Report_file: hba_human.matcher
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 145
# Identity:      63/145 (43.4%)
# Similarity:    88/145 (60.7%)
# Gaps:           8/145 ( 5.5%)
# Score: 264
#
#
#=======================================


              10        20        30        40         50
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
       :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
             10          20        30        40        50

                    60        70        80        90
HBA_HU -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
            :. .::.:::::  :.....::.:.. .....::.::. ::.:::
HBB_HU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
               60        70        80        90       100

         100       110       120       130       140
HBA_HU VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
        ::.::.. :. .:: :.  :::: :.:. .: .:.:...:. ::
HBB_HU ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
              110       120       130       140

#---------------------------------------
#--------------------------------------- 

Here is the output for the example giving the 10 best alignments:


########################################
# Program:  matcher
# Rundate:  Mon May 20 17:12:40 2002
# Report_file: hba_human.matcher10
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 145
# Identity:      63/145 (43.4%)
# Similarity:    88/145 (60.7%)
# Gaps:           8/145 ( 5.5%)
# Score: 264
#
#
#=======================================


              10        20        30        40         50
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
       :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
             10          20        30        40        50

                    60        70        80        90
HBA_HU -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
            :. .::.:::::  :.....::.:.. .....::.::. ::.:::
HBB_HU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
               60        70        80        90       100

         100       110       120       130       140
HBA_HU VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
        ::.::.. :. .:: :.  :::: :.:. .: .:.:...:. ::
HBB_HU ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
              110       120       130       140
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 13
# Identity:       6/13 (46.2%)
# Similarity:     9/13 (69.2%)
# Gaps:           0/13 ( 0.0%)
# Score: 32
#
#
#=======================================


      60        70
HBA_HU KKVADALTNAVAH
       .::. ...::.::
HBB_HU QKVVAGVANALAH
              140
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14       
# Extend_penalty: 4
#
# Length: 18
# Identity:       7/18 (38.9%)
# Similarity:    10/18 (55.6%)
# Gaps:           0/18 ( 0.0%)
# Score: 28
#
#
#=======================================


      90       100
HBA_HU KLRVDPVNFKLLSHCLLV
       :..:: :. . :.. :.:
HBB_HU KVNVDEVGGEALGRLLVV
         20        30
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 10
# Identity:       6/10 (60.0%)
# Similarity:     8/10 (80.0%)
# Gaps:           0/10 ( 0.0%)
# Score: 23
#
#
#=======================================


      10
HBA_HU VKAAWGKVGA
       :.::. :: :
HBB_HU VQAAYQKVVA
         130
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 10
# Identity:       6/10 (60.0%)
# Similarity:     6/10 (60.0%)
# Gaps:           0/10 ( 0.0%)
# Score: 23
#
#
#=======================================


      80
HBA_HU LSALSDLHAH
       :.:.::  ::
HBB_HU LGAFSDGLAH
        70
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#          
# Length: 17
# Identity:       5/17 (29.4%)
# Similarity:     9/17 (52.9%)
# Gaps:           0/17 ( 0.0%)
# Score: 21
#
#
#=======================================


         80        90
HBA_HU PNALSALSDLHAHKLRV
       :.:. .   . ::  .:
HBB_HU PDAVMGNPKVKAHGKKV
               60
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 12
# Identity:       5/12 (41.7%)
# Similarity:     6/12 (50.0%)
# Gaps:           0/12 ( 0.0%)
# Score: 21
#
#
#=======================================


      30        40
HBA_HU ERMFLSFPTTKT
       .:.: ::   .:
HBB_HU QRFFESFGDLST
       40        50
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 8
# Identity:       4/8 (50.0%)
# Similarity:     6/8 (75.0%)
# Gaps:           0/8 ( 0.0%)
# Score: 20
#
#
#=======================================


          110
HBA_HU LLVTLAAH
       .:: . ::
HBB_HU VLVCVLAH
      110
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 12
# Identity:       6/12 (50.0%) 
# Similarity:     6/12 (50.0%)
# Gaps:           0/12 ( 0.0%)
# Score: 20
#
#
#=======================================


             120
HBA_HU HLPAEFTPAVHA
       ::  :   :: :
HBB_HU HLTPEEKSAVTA
              10
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 14
# Extend_penalty: 4
#
# Length: 21
# Identity:       6/21 (28.6%)
# Similarity:     7/21 (33.3%)
# Gaps:           0/21 ( 0.0%)
# Score: 19
#
#
#=======================================


            10        20
HBA_HU PADKTNVKAAWGKVGAHAGEY
       :.. .  :.. : ..: : .:
HBB_HU PVQAAYQKVVAGVANALAHKY
          130       140

#---------------------------------------
#--------------------------------------- 

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

  1. X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381
  2. M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 upon successful completion.

Known bugs

None.

See also

Program nameDescription
seqmatchallDoes an all-against-all comparison of a set of sequences
supermatcherFinds a match of a large sequence against one or more sequences
waterSmith-Waterman local alignment
wordmatchFinds all exact matches of a given size between 2 sequences

water will give a single best rigorous local alignment. It will use memory of the order of the product of the lengths of the sequences to be aligned. If you wish the 'best' local alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.

Author(s)

This program was originally written by Bill Pearson as part of the FASTA package under the name 'lalign'.

This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

 Completed 11th May 1999.
 Last modified 19th July 1999.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments