siggen

 

Function

Generates a sparse protein signature from an alignment and residue contact data

Description

siggen parses a multiple structure alignment generated by the EMBOSS application scopalign and corresponding files of residue contact data generated by the EMBOSS application contacts and generates a protein signature of a specified sparsity.

Each position in the alignment is scored on the basis of a single or any combination of up to 3 scoring schemes. A signature of, for example, 10% sparsity would include data from the top 10% highest scoring alignment positions.

The resulting protein signature file is used by the application sigscan to find examples of the signature in other proteins.

Signatures

Signatures extend the comcept of the motif as a tool for characterizing protein families. They consist of a set of N key residue postitions (A1, A2 ...An) preceeded by gaps (G) thus G1A1G2A2...GnAn. Both a residue and a gap can be variable. A signature is matched to a protein sequence and scored using a dynamic programming algorithm which permits variability in gap distance and residue type. Generating a signature involves identifying residues associated with points of contact in interactions between secondary structure alements. A raw signature consists of a set of positions with potential key structural roles sampled from a sequence alignment constructed with reference to this contact data. Raw signatures are refined by samplinfg different gap-residue pairs until the specificity of a signature for the family cannot be further improved.

Usage

Here is a sample session with siggen:

% siggen
Generates a sparse protein signature
Location of alignment files for input [./]: ./jontest
Extension of alignment files for input [.align]:
Location of contact files for input [./]: ./jontest
Extension of contact files [.con]:
% sparsity of signature [10]:
Generate a randomized signature [N]:
Substitution matrix to be used [./EBLOSUM62]:
Score alignment on basis of residue conservation [Y]:
Score alignment on basis of number of contacts [Y]:
Score alignment on basis of conservation of contacts [Y]: N
Score alignment on a combined measure of number and conservation of contacts [N]:
Ignore alignment postitions with post_similar value of 0 [Y]:
Name of signature file for output [sig.sig]:

Command line arguments

   Mandatory qualifiers (* if not always prompted):
  [-algpath]           string     Location of scop structure-based sequence
                                  alignment files (input)
  [-algextn]           string     Extension of alignment files
   -sparsity           integer    % sparsity of signature
*  -seqoption          menu       Select number
*  -datafile           matrixf    This is the scoring matrix file used when
                                  comparing sequences.
*  -conoption          menu       Select number
*  -filtercon          boolean    Ignore alignment positions making less than
                                  a threshold number of contacts
*  -conthresh          integer    Threshold contact number
*  -conpath            string     Location of contact files (input)
*  -conextn            string     Extension of contact files
*  -cpdbpath           string     Location of domain coordinate files (embl
                                  format input)
*  -cpdbextn           string     Extension of coordinate files
*  -filterpsim         boolean    Ignore alignment postitions with
                                  post_similar value of 0
  [-sigpath]           string     Location of signature files (output)
  [-sigextn]           string     Extension of signature files

   Optional qualifiers: (none)
   Advanced qualifiers:
   -randomise          boolean    Generate a randomised signature

   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-algpath]
(Parameter 1)
Location of scop structure-based sequence alignment files (input) Any string is accepted ./
[-algextn]
(Parameter 2)
Extension of alignment files Any string is accepted .salign
-sparsity % sparsity of signature Any integer value 10
-seqoption Select number
1 (Substitution matrix)
2 (Residue class)
3 (None)
3
-datafile This is the scoring matrix file used when comparing sequences. Comparison matrix file in EMBOSS data path EBLOSUM62
-conoption Select number
1 (Number)
2 (Conservation)
3 (Number and conservation)
4 (None)
4
-filtercon Ignore alignment positions making less than a threshold number of contacts Yes/No No
-conthresh Threshold contact number Any integer value 10
-conpath Location of contact files (input) Any string is accepted ./
-conextn Extension of contact files Any string is accepted .con
-cpdbpath Location of domain coordinate files (embl format input) Any string is accepted ./
-cpdbextn Extension of coordinate files Any string is accepted .pxyz
-filterpsim Ignore alignment postitions with post_similar value of 0 Yes/No No
[-sigpath]
(Parameter 3)
Location of signature files (output) Any string is accepted ./
[-sigextn]
(Parameter 4)
Extension of signature files Any string is accepted .sig
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-randomise Generate a randomised signature Yes/No No

Input file format

siggen reads in multiple structure alignment generated by the EMBOSS application scopalign and corresponding files of residue contact data generated by the EMBOSS application contacts.

Output file format

The output file (Figure 1) uses the following records. The four SCOP classification records are taken from the alignment input file:

  1. CL - Domain class. It is identical to the text given after 'Class' in the scop classification file (see documentation for the EMBOSS application scope).
  2. FO - Domain fold. It is identical to the text given after 'Fold' in the scop classification file (see scope documentation).
  3. SF - Domain superfamily. It is identical to the text given after 'Superfamily' in the scop classification file (see scope documentation).
  4. FA - Domain family. It is identical to the text given after 'Family' in the scop classification file (see scope documentation).
  5. NP - Number of signature positions.
  6. NN - Signature position number. The number given in brackets after this record indicates the start of the data for the relevent signature positi on.
  7. IN - Informative line about signature position. The number of different amino acid residues seen for this position is given after 'NRES', the number of different sizes of gap follows 'NGAP', and the window size after 'WSIZ'. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size for the C-terminal position (see sigscan.c) Two rows of data for the emprical residues and gaps are then given:
  8. AA - The identifier of a residue seen in this position and the frequency of its occurence are delimited by ';'.
  9. GA - The size of a gap seen in this position and the frequency of its occurence are delimited by ';'.
  10. // - used to delimit data for each signature. The last line of a file always contains '//' only.

Example excerpt from an output signature file:


CL   All beta proteins
XX
FO   Lipocalins
XX
SF   Lipocalins
XX
FA   Fatty acid binding protein-like
XX
NP   2
XX
NN   [1] 
XX
IN   NRES 3 ; NGAP 2 ; WSIZ 2  
XX
AA   A ; 2
AA   V ; 1
AA   L ; 4
XX
GA   1 ; 5
GA   2 ; 2
XX
NN   [2] 
XX
IN   NRES 2 ; NGAP 2 ; WSIZ 5  
XX
AA   F ; 1
AA   Y ; 5
XX
GA   12 ; 3
GA   10 ; 2
XX
//

Important

  1. In the case a signature file is generated by hand, it is essential that the gap data given is listed in order of increasing gap size.
  2. In the current implementation, window size records always have the value of 0. These should be changed manually unless a very rigid pattern is required. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file).
  3. Siggen presumes that standard SCOP domain identifiers are given in the input alignment if the id is 7 characters long and the first character is a 'd' or 'D'. In this case the contact data for that chain will be parsed. Otherwise contact data for chain 1 will be parsed.

Data files

siggen reads in a protein residue comparison matrix. By default, this is EBLOSUM62.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

Ison JC, Blades MJ, Bleasby AJ, Daniel SC, Parish JH "Key residues approach to the definition of protein families and analysis of sparse family signatures" (2000) PROTEINS: Structure, Function and Genetics 40:330-341

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
contactsReads coordinate files and writes files of intra-chain residue-residue contact data
dichetParse dictionary of heterogen groups
hmmgenGenerates a hidden Markov model for each alignment in a directory
interfaceReads coordinate files and writes files of inter-chain residue-residue contact data
profgenGenerates various profiles for each alignment in a directory
psiblastsRuns PSI-BLAST given scopalign alignments
scopalignGenerate alignments for families in a scop classification file by using STAMP
scoprepReorder scop classificaiton file so that the representative structure of each family is given first
scopresoRemoves low resolution domains from a scop classification file
seqalignGenerate extended alignments for families in a scop families file by using CLUSTALW with seed alignments
seqsearchGenerate files of hits for families in a scop classification file by using PSI-BLAST with seed alignments
seqsortReads multiple files of hits and writes a non-ambiguous file of hits (scop families file) plus a validation file
seqwordsGenerate file of hits for scop families by searching swissprot with keywords
sigscanScans a signature against swissprot and writes a signature hits files

Author(s)

This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)

History

Written (June 2001) - Jon Ison.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments