![]() |
sigscan |
% sigscan
Mandatory qualifiers: [-sigin] infile Name of signature file (input) -database seqall Name of swissprot sequence database to search -targetf infile Name of validation (input) -thresh integer Minimum length (residues) of overlap required for two hits with the same code to be counted as the same hit. -sub matrixf This is the scoring matrix file used when comparing sequences. -gapo float The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAMAT matrix for nucleotide sequences. -gape float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. -nterm menu Select number -nhits integer Number of hits to output [-hitsf] outfile Name of signature hits file (output) [-alignf] outfile Name of signature alignments file (output) Optional qualifiers: (none) Advanced qualifiers: (none) General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |||||||
---|---|---|---|---|---|---|---|---|---|
[-sigin] (Parameter 1) |
Name of signature file (input) | Input file | test.sig | ||||||
-database | Name of swissprot sequence database to search | Readable sequence(s) | ./test.seq | ||||||
-targetf | Name of validation (input) | Input file | test.valid.in | ||||||
-thresh | Minimum length (residues) of overlap required for two hits with the same code to be counted as the same hit. | Any integer value | 20 | ||||||
-sub | This is the scoring matrix file used when comparing sequences. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||
-gapo | The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAMAT matrix for nucleotide sequences. | Floating point number from 1.0 to 100.0 | 10.0 for any sequence | ||||||
-gape | The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Floating point number from 0.0 to 10.0 | 0.5 for any sequence | ||||||
-nterm | Select number |
|
1 | ||||||
-nhits | Number of hits to output | Any integer value | 100 | ||||||
[-hitsf] (Parameter 2) |
Name of signature hits file (output) | Output file | test.sighits | ||||||
[-alignf] (Parameter 3) |
Name of signature alignments file (output) | Output file | test.sigalign | ||||||
Optional qualifiers | Allowed values | Default | |||||||
(none) | |||||||||
Advanced qualifiers | Allowed values | Default | |||||||
(none) |
Example excerpt from a signature hits file:
DE Results of signature search XX CL All alpha proteins XX FO Globin-like XX SF Globin-like XX FA Globins XX XX HI 1 1RBPDFG 1 TRUE TRUE 234 0.0001 HI 2 1GFT35J 3 TRUE TRUE 234 0.0008 HI 3 1KJUFGH 1 TRUE TRUE 224 0.0108 HI 4 1GYU15R 2 CLOSE TRUE 220 0.1876 HI 5 1LKI89O 2 CLOSE TRUE 203 0.6787 HI 6 1QRTY58 1 TRUE TRUE 199 0.9978 HI 7 2IOM78G 1 FALSE FALSE 198 1.0844 HI 8 1SZR234 1 CLOSE TRUE 198 1.4343 HI 9 3PONI57 1 DISTANT FALSE 197 2.8849 HI 10 1PHDJBS 3 CLOSE TRUE 190 2.9872 HI 11 1HIOHDW 1 UNKNOWN UNKNOWN 160 5,8676 HI 12 199976T 1 CLOSE TRUE 140 8.8346 XX //
(1) The DE, CL, FO, SF, FA, XX and // records have the same meaning as in the hits file (above).
(2) Other lines contain either a fragment of protein sequence preceeded by an accession number, or a fragment of an alignment of a signature to the protein sequence (signature positions are marked with a '*'). The two numbers on either side of the sequence are begin and end residue numbers for that line.
Example excerpt from a signature alignment file
DE Results of signature search XX CL Alpha and beta proteins (a/b) XX FO alpha/beta-Hydrolases XX SF alpha/beta-Hydrolases XX FA Acetylcholinesterase-like XX OPSD_HUMAN 1 MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMF 45 SIGNATURE - ---------*------------*---------------*------ OPSD_XENLA 1 MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMF 45 SIGNATURE - --------*-------------*----------------*----- XX OPSD_HUMAN 46 LLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGG 90 SIGNATURE - --------------*--------------*------------*-- OPSD_XENLA 46 LLILLGLPINFMTLFVTIQHKKLRTPLNYILLNLVFANHFMVLCG 90 SIGNATURE - --------------*--------------*------------*-- XX OPSD_HUMAN 91 FTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIER 135 SIGNATURE - ---------*--*--------------------------**---- OPSD_XENLA 91 FTVTMYTSMHGYFIFGPTGCYIEGFFATLGGEVALWSLVVLAVER 135 SIGNATURE - ---------*----*-------------------------**--- XX //
Definition of classes of hit
The primary classification is an objective definition of the hit and has one of the following values:
TRAIN - the sequence was included in the original alignment from which the signature was generated.
PSIBLAST - A protein which was detected by psiblast (see psiblasts.c) to be a homologue to at least one of the proteins in the family from which the signature was derived. Such proteins are identified by the 'PSIBLAST' record in the scop families file.
OTHER - A true member of the family but not a homologue as detected by psi-blast. Such proteins may have been found from the literature and manually added to the scop families file or may have been detected by the EMBOSS program swissparse (see swissparse.c). They are identified in the
SCOP families file by the 'OTHER' record.
CROSS - A protein which is homologous to a protein of the same fold, but differnt family, of the proteins from which the signature was derived.
FALSE - A homologue to a protein with a different fold to the family of the signature.
UNKNOWN - The protein is not known to be CROSS, FALSE or a true hit (TRAIN, PSIBLAST or OTHER).
The secondary classification is provided for convenience and a value as follows:
Hits of TRAIN, PSIBLAST and OTHER classification are all listed as TRUE.
Hits of CROSS, FALSE or UNKNOWN objective classification are listed as CROSS, FALSE or UNKNOWN respectively.
The subjective column allows for hand-annotation of the hits files so that proteins of UNKNOWN objective classification can re-classified by a human expert as TRUE, FALSE, CROSS or otherwise left as UNKNOWN for the purpose of generating signature performance plots with the EMBOSS application sigplot.
Important - In the case where a signature file is generated by hand, it is essential that the gap data given is listed in order of increasing gap size.
Program name | Description |
---|---|
contacts | Reads coordinate files and writes files of intra-chain residue-residue contact data |
dichet | Parse dictionary of heterogen groups |
hmmgen | Generates a hidden Markov model for each alignment in a directory |
interface | Reads coordinate files and writes files of inter-chain residue-residue contact data |
profgen | Generates various profiles for each alignment in a directory |
psiblasts | Runs PSI-BLAST given scopalign alignments |
scopalign | Generate alignments for families in a scop classification file by using STAMP |
scoprep | Reorder scop classificaiton file so that the representative structure of each family is given first |
scopreso | Removes low resolution domains from a scop classification file |
seqalign | Generate extended alignments for families in a scop families file by using CLUSTALW with seed alignments |
seqsearch | Generate files of hits for families in a scop classification file by using PSI-BLAST with seed alignments |
seqsort | Reads multiple files of hits and writes a non-ambiguous file of hits (scop families file) plus a validation file |
seqwords | Generate file of hits for scop families by searching swissprot with keywords |
siggen | Generates a sparse protein signature from an alignment and residue contact data |