![]() |
getorf |
The ORFs can be defined as regions of a specified minimum size between STOP codons or between START and STOP codons.
The ORFs can be output as the nucleotide sequence or as the translation.
The program can also output the region around the START or the initial STOP codon or the ending STOP codons of an ORF for those doing analysis of the properties of these regions.
The START and STOP codons are defined in the Genetic Code tables. A suitable Genetic Code table can be selected for the organism you are investigating.
% getorf -minsize 300 Input sequence: embl:eclaci Output sequence [eclaci.orf]:
Mandatory qualifiers: [-sequence] seqall Sequence database USA [-outseq] seqoutall Output sequence(s) USA Optional qualifiers: -table menu Code to use -minsize integer Minimum nucleotide size of ORF to report -find menu This is a small menu of possible output options. The first four options are to select either the protein translation or the original nucleic acid sequence of the open reading frame. There are two possible definitions of an open reading frame: it can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon. The last three options are probably only of interest to people who wish to investigate the statistical properties of the regions around potential START or STOP codons. The last option assumes that ORF lengths are calculated between two STOP codons. Advanced qualifiers: -[no]methionine boolean START codons at the beginning of protein products will usually code for Methionine, despite what the codon will code for when it is internal to a protein. This qualifier sets all such START codons to code for Methionine by default. -circular boolean Is the sequence circular -[no]reverse boolean Set this to be false if you do not wish to find ORFs in the reverse complement of the sequence. -flanking integer If you have chosen one of the options of the type of sequence to find that gives the flanking sequence around a STOP or START codon, this allows you to set the number of nucleotides either side of that codon to output. If the region of flanking nucleotides crosses the start or end of the sequence, no output is given for this codon. General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[-sequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required | ||||||||||||||||||||||||||||||||||||
[-outseq] (Parameter 2) |
Output sequence(s) USA | Writeable sequence(s) | <sequence>.format | ||||||||||||||||||||||||||||||||||||
Optional qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
-table | Code to use |
|
0 | ||||||||||||||||||||||||||||||||||||
-minsize | Minimum nucleotide size of ORF to report | Any integer value | 30 | ||||||||||||||||||||||||||||||||||||
-find | This is a small menu of possible output options. The first four options are to select either the protein translation or the original nucleic acid sequence of the open reading frame. There are two possible definitions of an open reading frame: it can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon. The last three options are probably only of interest to people who wish to investigate the statistical properties of the regions around potential START or STOP codons. The last option assumes that ORF lengths are calculated between two STOP codons. |
|
0 | ||||||||||||||||||||||||||||||||||||
Advanced qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
-[no]methionine | START codons at the beginning of protein products will usually code for Methionine, despite what the codon will code for when it is internal to a protein. This qualifier sets all such START codons to code for Methionine by default. | Yes/No | Yes | ||||||||||||||||||||||||||||||||||||
-circular | Is the sequence circular | Yes/No | No | ||||||||||||||||||||||||||||||||||||
-[no]reverse | Set this to be false if you do not wish to find ORFs in the reverse complement of the sequence. | Yes/No | Yes | ||||||||||||||||||||||||||||||||||||
-flanking | If you have chosen one of the options of the type of sequence to find that gives the flanking sequence around a STOP or START codon, this allows you to set the number of nucleotides either side of that codon to output. If the region of flanking nucleotides crosses the start or end of the sequence, no output is given for this codon. | Any integer value | 100 |
The results from the example run are:
>ECLACI_1 [735 - 1112] E. coli laci gene (codes for the lac repressor). GHRSHCDAGCQRSDGAGRNARHYRVRAARWCGYLGSGIRRYRRQLMLYPAVNHHQTGFSP AGANQRGPLAATLSGPGGEGQSAVARLTGEKKNHPGAQYANRLSPRVGRFINAAGTTGFP TGKRAV >ECLACI_2 [1 - 1110] E. coli laci gene (codes for the lac repressor). PEESQFRVVNVKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN RVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAA VHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSHEDGTRLG VEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSAMSGFQQTM QMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSCYIPPSTTIK QDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA RQVSRLESGQ* >ECLACI_3 [465 - 49] E. coli laci gene (codes for the lac repressor). RRNISAGSFHSNGILVIQRIVNDQPTDALREKIVHRRFTGFDAASFYHRHHHAGTQLIGA RFNRRDNLRRRVQGQTGGGNANQQRLFARQLLCHAVGNVIQLRHRRFHFFPRFRRNVAGL VHHAGNGLIRDTGILCDIV
All output ORF sequences are written to the specified outut file.
The name of the ORF sequences is constructed from the name of the input sequence with an underscore character ('_') and a unique ordinal number of the ORF found appended. The description of the output ORF sequence is constructed from the description of the input sequence with the start and end positions of the ORF prepended.
The unique number appended to the name is simply used to create new unique sequence names, it does not imply any further information indicating any order, positioning or sense-strand of the ORFs.
If the ORF has been found in the reverse sense, then the start position will be smaller than the end position. The numbering uses the forward-sense positions, but read in the reverse sense. For example, >ECLACI_3 [465 - 49] in the output above is a reverse-sense ORF running from position 465 to 49.
The default file EGC.0 is the 'Standard Code' with the rarely used alternate START codons omitted, it only has the normal 'AUG' START codon. The 'Standard Code' with the rarely used alternate START codons included is Genetic Code file EGC.1.
It is expected that user will sometimes wish to customise a Genetic Code file. To do this, use the program embossdata.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
The Genetic Code data files are based on the NCBI genetic code tables. Their names and descriptions are:
The format of these files is very simple.
It consists of several lines of optional comments, each starting with a '#' character.
These are followed the line: 'Genetic Code [n]', where 'n' is the number of the genetic code file.
This is followed by the description of the code and then by four lines giving the IUPAC one-letter code of the translated amino acid, the start codons (indicdated by an 'M') and the three bases of the codon, lined up one on top of the other.
For example:
------------------------------------------------------------------------------ # Genetic Code Table # # Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html # and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c # # Differs from Genetic Code [1] only in that the initiation sites have been # changed to only 'AUG' Genetic Code [0] Standard AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG Starts = -----------------------------------M---------------------------- Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG ------------------------------------------------------------------------------
Program name | Description |
---|---|
marscan | Finds MAR/SAR sites in nucleic sequences |
plotorf | Plot potential open reading frames |
showorf | Pretty output of DNA translations |
wobble | Wobble base plot |