wordmatch

 

Function

Finds all exact matches of a given size between 2 sequences

Description

Finds all exact matches of a given minimum size between 2 sequences displaying the start points in each sequence and the match length.

This program takes two sequences and finds regions where they are identical. These regions are reported in the output file (and optionally) in GFF (Gene Feature Format) files.

It will not find identical regions smaller than the specified wordsize.

Usage

Here is a sample session with wordmatch.

% wordmatch tsw:hba_human tsw:hbb_human
Finds all exact matches of a given size between 2 sequences
Word size [4]:
Output alignment [hba_human.wordmatch]: 

Command line arguments

   Mandatory qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
  [-outfile]           align      Output alignment file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -afeatout           featout    File for output of normal tab delimited GFF
                                  features
   -bfeatout           featout    File for output of normal tab delimited GFF
                                  features

   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 4
[-outfile]
(Parameter 3)
Output alignment file name Alignment file  
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-afeatout File for output of normal tab delimited GFF features Writeable feature table unknown.gff
-bfeatout File for output of normal tab delimited GFF features Writeable feature table unknown.gff

Input file format

Any two sequence USAs of the same type (DNA or protein).

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/AlignFormats.html for further information on alignment formats.

The file produced in the above example is:


########################################
# Program:  wordmatch
# Rundate:  Mon May 20 16:36:46 2002
# Report_file: hba_human.wordmatch
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
#=======================================

     5 HBA_HUMAN             58..62       HBB_HUMAN             63..67
     4 HBA_HUMAN             14..17       HBB_HUMAN             15..18
     4 HBA_HUMAN            116..119      HBB_HUMAN            121..124

#---------------------------------------
#--------------------------------------- 

The normal 'report' header is output. It contains the details of the program run and the input sequences.

The data lines consist of five columns separated by spaces or TAB characters. Each line contains the information on one identical region. The first column is the length of the match. The second column is the name of the first sequence. The third column is the start and end position of the match. The next two columns are the name and positions of the second sequence.

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None.

See also

Program nameDescription
matcherFinds the best local alignments between two sequences
seqmatchallDoes an all-against-all comparison of a set of sequences
supermatcherFinds a match of a large sequence against one or more sequences
waterSmith-Waterman local alignment

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Completed 27th November 1998.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments