![]() |
matcher |
matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996
Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program, which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
Like water, matcher is rigorous, but also very slow. The advantage of matcher is that it uses far less memory than water, so you are much less likely to run out of memory when aligning large sequences.
matcher will also report a specified number of alignments between the two sequences showing the actual local alignments. (water will only report the single best match.) The default number of alignments output is 1, but can be increased to (for example) the 10 best alignments by using the '-alternatives 10' command-line qualifier. In some cases, for example multidomain proteins or cDNA and genomic DNA comparisons, there may be many interesting and significant alignments.
% matcher tsw:hba_human tsw:hbb_human Finds the best local alignments between two sequences Output file [hba_human.matcher]:
Here is an example to find the 10 best alignments:
% matcher tsw:hba_human tsw:hbb_human -alt 10 Finds the best local alignments between two sequences Output file [hba_human.matcher]: hba_human.matcher10
Mandatory qualifiers: [-sequencea] sequence Sequence USA [-sequenceb] sequence Sequence USA [-outfile] align Output alignment file name Optional qualifiers: -datafile matrix This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. -alternatives integer This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. -gappenalty integer The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. -gaplength integer The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Advanced qualifiers: (none) General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequencea] (Parameter 1) |
Sequence USA | Readable sequence | Required |
[-sequenceb] (Parameter 2) |
Sequence USA | Readable sequence | Required |
[-outfile] (Parameter 3) |
Output alignment file name | Alignment file | |
Optional qualifiers | Allowed values | Default | |
-datafile | This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAFULL for DNA |
-alternatives | This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. | Integer 1 or more | 1 |
-gappenalty | The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAFULL matrix for nucleotide sequences. | Positive integer | 14 for protein, 16 for nucleic |
-gaplength | The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Positive integer | 4 for any sequence |
Advanced qualifiers | Allowed values | Default | |
(none) |
The output is a standard EMBOSS alignment file.
The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.
The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs
The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score
See: http://www.uk.embnet.org/Software/EMBOSS/Themes/AlignFormats.html for further information on alignment formats.
Here is the output for the example:
######################################## # Program: matcher # Rundate: Mon May 20 17:12:29 2002 # Report_file: hba_human.matcher ######################################## #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 145 # Identity: 63/145 (43.4%) # Similarity: 88/145 (60.7%) # Gaps: 8/145 ( 5.5%) # Score: 264 # # #======================================= 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST 10 20 30 40 50 60 70 80 90 HBA_HU -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP :. .::.::::: :.....::.:.. .....::.::. ::.::: HBB_HU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP 60 70 80 90 100 100 110 120 130 140 HBA_HU VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY ::.::.. :. .:: :. :::: :.:. .: .:.:...:. :: HBB_HU ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY 110 120 130 140 #--------------------------------------- #---------------------------------------
Here is the output for the example giving the 10 best alignments:
######################################## # Program: matcher # Rundate: Mon May 20 17:12:40 2002 # Report_file: hba_human.matcher10 ######################################## #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 145 # Identity: 63/145 (43.4%) # Similarity: 88/145 (60.7%) # Gaps: 8/145 ( 5.5%) # Score: 264 # # #======================================= 10 20 30 40 50 HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST 10 20 30 40 50 60 70 80 90 HBA_HU -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP :. .::.::::: :.....::.:.. .....::.::. ::.::: HBB_HU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP 60 70 80 90 100 100 110 120 130 140 HBA_HU VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY ::.::.. :. .:: :. :::: :.:. .: .:.:...:. :: HBB_HU ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY 110 120 130 140 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 13 # Identity: 6/13 (46.2%) # Similarity: 9/13 (69.2%) # Gaps: 0/13 ( 0.0%) # Score: 32 # # #======================================= 60 70 HBA_HU KKVADALTNAVAH .::. ...::.:: HBB_HU QKVVAGVANALAH 140 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 18 # Identity: 7/18 (38.9%) # Similarity: 10/18 (55.6%) # Gaps: 0/18 ( 0.0%) # Score: 28 # # #======================================= 90 100 HBA_HU KLRVDPVNFKLLSHCLLV :..:: :. . :.. :.: HBB_HU KVNVDEVGGEALGRLLVV 20 30 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 10 # Identity: 6/10 (60.0%) # Similarity: 8/10 (80.0%) # Gaps: 0/10 ( 0.0%) # Score: 23 # # #======================================= 10 HBA_HU VKAAWGKVGA :.::. :: : HBB_HU VQAAYQKVVA 130 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 10 # Identity: 6/10 (60.0%) # Similarity: 6/10 (60.0%) # Gaps: 0/10 ( 0.0%) # Score: 23 # # #======================================= 80 HBA_HU LSALSDLHAH :.:.:: :: HBB_HU LGAFSDGLAH 70 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 17 # Identity: 5/17 (29.4%) # Similarity: 9/17 (52.9%) # Gaps: 0/17 ( 0.0%) # Score: 21 # # #======================================= 80 90 HBA_HU PNALSALSDLHAHKLRV :.:. . . :: .: HBB_HU PDAVMGNPKVKAHGKKV 60 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 12 # Identity: 5/12 (41.7%) # Similarity: 6/12 (50.0%) # Gaps: 0/12 ( 0.0%) # Score: 21 # # #======================================= 30 40 HBA_HU ERMFLSFPTTKT .:.: :: .: HBB_HU QRFFESFGDLST 40 50 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 8 # Identity: 4/8 (50.0%) # Similarity: 6/8 (75.0%) # Gaps: 0/8 ( 0.0%) # Score: 20 # # #======================================= 110 HBA_HU LLVTLAAH .:: . :: HBB_HU VLVCVLAH 110 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 12 # Identity: 6/12 (50.0%) # Similarity: 6/12 (50.0%) # Gaps: 0/12 ( 0.0%) # Score: 20 # # #======================================= 120 HBA_HU HLPAEFTPAVHA :: : :: : HBB_HU HLTPEEKSAVTA 10 #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 21 # Identity: 6/21 (28.6%) # Similarity: 7/21 (33.3%) # Gaps: 0/21 ( 0.0%) # Score: 19 # # #======================================= 10 20 HBA_HU PADKTNVKAAWGKVGAHAGEY :.. . :.. : ..: : .: HBB_HU PVQAAYQKVVAGVANALAHKY 130 140 #--------------------------------------- #---------------------------------------
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
Program name | Description |
---|---|
seqmatchall | Does an all-against-all comparison of a set of sequences |
supermatcher | Finds a match of a large sequence against one or more sequences |
water | Smith-Waterman local alignment |
wordmatch | Finds all exact matches of a given size between 2 sequences |
water will give a single best rigorous local alignment. It will use memory of the order of the product of the lengths of the sequences to be aligned. If you wish the 'best' local alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.
This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
Completed 11th May 1999. Last modified 19th July 1999.