![]() |
fuzznuc |
Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence.
fuzznuc intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified.
% fuzznuc Input sequence: embl:hhtetra Search pattern: AAGCTT Number of mismatches [0]: Output file [hhtetra.fuzznuc]:
Mandatory qualifiers: [-sequence] seqall Sequence database USA -pattern string The standard IUPAC one-letter codes for the nucleotides are used. The symbol `n' is used for a position where any nucleotide is accepted. Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses `[ ]'. For example: [ACG] stands for A or C or G. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G. Each element in a pattern is separated from its neighbor by a `-'. (Optional in fuzznuc). Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N. When a pattern is restricted to either the 5' or 3' end of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. A period ends the pattern. (Optional in fuzznuc). For example, [CG](5)TG{A}N(1,5)C -mismatch integer Number of mismatches [-outfile] report Output report file name Optional qualifiers: (none) Advanced qualifiers: -complement boolean Search complementary strand General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required |
-pattern | The standard IUPAC one-letter codes for the nucleotides are used. The symbol `n' is used for a position where any nucleotide is accepted. Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses `[ ]'. For example: [ACG] stands for A or C or G. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G. Each element in a pattern is separated from its neighbor by a `-'. (Optional in fuzznuc). Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N. When a pattern is restricted to either the 5' or 3' end of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. A period ends the pattern. (Optional in fuzznuc). For example, [CG](5)TG{A}N(1,5)C | Any string is accepted | An empty string is accepted |
-mismatch | Number of mismatches | Integer 0 or more | 0 |
[-outfile] (Parameter 2) |
Output report file name | Report file | |
Optional qualifiers | Allowed values | Default | |
(none) | |||
Advanced qualifiers | Allowed values | Default | |
-complement | Search complementary strand | Yes/No | No |
The PROSITE pattern definition from the PROSITE documentation (amended to refer to nucleic acid sequences, not proteins) follows.
For example, in the EMBL entry ECLAC you can look for the pattern:
[CG](5)TG{A}N(1,5)C
This searches for "C or G" 5 times, followed by T and G, then anything except A, then any base (1 to 5 times) before a C.
You can use ambiguity codes for nucleic acid searches but not within [] or {} as they expand to bracketed counterparts. For example, "s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal.
Note the use of X is reserved for proteins. You must use N for nucleic acids to refer to any base.
The search is case-independent, so 'AAA' matches 'aaa'.
The output is a standard EMBOSS report file.
The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq
See: http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for further information on report formats.
By default fuzznuc writes a 'seqtable' report file.
The output from the above example is:
######################################## # Program: fuzznuc # Rundate: Thu Apr 11 13:34:06 2002 # Report_file: stdout ######################################## #======================================= # # Sequence: HHTETRA from: 1 to: 1272 # HitCount: 2 # # Pattern: aagctt # Mismatch: 0 # Complement: No # #======================================= Start End Mismatch Sequence 1 6 . aagctt 1267 1272 . aagctt #--------------------------------------- #---------------------------------------
Program name | Description |
---|---|
dreg | regular expression search of a nucleotide sequence |
fuzztran | Protein pattern search after translation |
marscan | Finds MAR/SAR sites in nucleic sequences |
Other EMBOSS programs allow you to search for regular expression patterns but may be less easy for the user who has never used regular expressions before: