etandem

 

Function

Looks for tandem repeats in a nucleotide sequence

Description

etandem looks for tandem repeats in a sequence. It is normally used after equicktandem has been run to identify potential repeat sizes. It calculates a consensus for the repeat region and gives a score for how many matches there are to the consensus - the number of mismatches.

Input sequences are converted into ACGT or N (so ambiguity codes are ignored).
The score is +1 for a match, -1 for a mismatch.
The first copy of a repeat is ignored.
The highest score is kept for each start position and repeat size.

The lowest score to be reported is set by the threshold score. The threshold score can be set on the command-line using the -threshold qualifier, the default is 20. For perfect repeats, the score is the length of the repeat (except for the first copy). Reduce the threshold score a little if you wish to to allow mismatches. Each mismatch scores -1 instead of +1 so it scores 2 less than a perfect match of the same number of bases.

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

Usage

Here is a sample session with etandem. The input sequence is the human herpesvirus tandem repeat.

% etandem
Input sequence: embl:hhtetra
Output file [hhtetra.tan]: 
Minimum repeat size [10]: 6
Maximum repeat size [6]: 

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
   -minrepeat          integer    Minimum repeat size
   -maxrepeat          integer    Maximum repeat size
  [-outfile]           report     Output report file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -threshold          integer    Threshold score
   -mismatch           boolean    Allow N as a mismatch
   -uniform            boolean    Allow uniform consensus
   -origfile           outfile    Output file name

   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence USA Readable sequence Required
-minrepeat Minimum repeat size Integer, 2 or higher 10
-maxrepeat Maximum repeat size Integer, same as -minrepeat or higher Same as -minrepeat
[-outfile]
(Parameter 2)
Output report file name Report file  
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-threshold Threshold score Any integer value 20
-mismatch Allow N as a mismatch Yes/No No
-uniform Allow uniform consensus Yes/No No
-origfile Output file name Output file <sequence>.etandem

Input file format

The input for etandem is a nucleotide sequence.

Output file format

The output is a standard EMBOSS report file.

The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for further information on report formats.

By default etandem writes a 'table' report file.

The output from the above example is:


########################################
# Program: etandem
# Rundate: Thu Apr 11 13:31:10 2002
# Report_file: stdout
########################################

#=======================================
#
# Sequence: HHTETRA     from: 1   to: 1272
# HitCount: 5
#
# Threshold: 20
# Minrepeat: 6
# Maxrepeat: 6
# Mismatch: No
# Uniform: No
#
#=======================================

  Start     End   Score   Size  Count Identity Consensus
    793     936     120      6     24     93.8 acccta
    283     420      90      6     23     84.8 taaccc
    432     485      38      6      9     90.7 ccctaa
    494     529      26      6      6     94.4 ccctaa
    568     597      24      6      5    100.0 aaccct

#---------------------------------------
#---------------------------------------

Data files

Notes

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

References

None.

Warnings

None.

Diagnostics

None.

Exit status

None.

Known bugs

None.

See also

Program nameDescription
einvertedFinds DNA inverted repeats
equicktandemFinds tandem repeats
palindromeLooks for inverted repeats in a nucleotide sequence

Running with a wide range of repeat sizes is inefficient. That is why equicktandem was written - to give a rapid estimate of the major repeat sizes.

Authors

This program was originally written by Richard Durbin at the Sanger Centre.

This application was modified for inclusion in EMBOSS by Peter Rice (pmr@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

Priority

Completed 25 May 1999

Target

etandem is aimed at automated repeat identification in genomic sequnece but can also be used by general users.

Comments