domainer

 

Function

Reads protein coordinate files and writes domains coordinate files

Description

Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development. It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects.

The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the Protein Data Bank (PDB).

domainer reads in an EMBL-like format SCOP classification file generated by the EMBOSS applications scope or nrscope, and EMBL-like format clean protein coordinate files generated by the coorde application. (not currently in EMBOSS, email Jon Ison jison@hgmp.mrc.ac.uk) For each domain in the scop classification file domainer writes clean domain coordinate files in EMBL-like and PDB formats. Each of these output files contains coordinates for a single SCOP domain. In cases where multiple models were determined, the data in the domain files correspond to the first model. In the rare cases where a domain is comprised of more than one chain, the data will be presented as belonging to a single chain (i.e. a single sequence, chain identifier etc will be given).

Usage

Here is a sample session with domainer:


% domainer
Build domain coordinate files
Name of scop file for input (embl-like format) [Escop.dat]: /data/scop/Escop.dat
Location of coordinate files for input (embl-like format) [./]: /data/cpdb/
Location of coordinate files for output (embl-like format) [./]:
Extension of coordinate files (embl-like format) [.pxyz]:
Location of coordinate files for output (pdb format) [./]:
Extension of coordinate files (pdb format) [.ent]:
Name of log file for the embl-like format build [domainer.log1]: log.1
Name of log file for the pdb format build [domainer.log2]: log.2
D3SDHA_
D3SDHB_
D3HBIA_
D3HBIB_
D4SDHA_
D4SDHB_
D4HBIA_
D4HBIB_
D5HBIA_
D5HBIB_
D7HBIA_
D7HBIB_

Command line arguments

   Mandatory qualifiers:
   -scop               infile     Name of scop classification file (embl
                                  format input)
  [-cpdb]              string     Location of protein coordinate files (embl
                                  format input)
  [-cpdbextn]          string     Extension of coordinate files (embl format)
  [-cpdbscop]          string     Location of domain coordinate files (embl
                                  format output)
  [-pdbscop]           string     Location of domain coordinate files (pdb
                                  format output)
  [-pdbextn]           string     Extension of coordinate files (pdb format)
   -cpdberrf           outfile    Name of log file for the embl format build
   -pdberrf            outfile    Name of log file for the pdb format build

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
-scop Name of scop classification file (embl format input) Input file Escop.dat
[-cpdb]
(Parameter 1)
Location of protein coordinate files (embl format input) Any string is accepted ./
[-cpdbextn]
(Parameter 2)
Extension of coordinate files (embl format) Any string is accepted .pxyz
[-cpdbscop]
(Parameter 3)
Location of domain coordinate files (embl format output) Any string is accepted ./
[-pdbscop]
(Parameter 4)
Location of domain coordinate files (pdb format output) Any string is accepted ./
[-pdbextn]
(Parameter 5)
Extension of coordinate files (pdb format) Any string is accepted .ent
-cpdberrf Name of log file for the embl format build Output file domainer_embl.log
-pdberrf Name of log file for the pdb format build Output file domainer_pdb.log
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

The EMBL-like format used for the input clean protein data (and the output domain format) uses the following records:

(1) ID - Either the 4-character PDB identifier code (for clean protein coordinate files) or the 7-character domain identifier code taken from scop (for domain coordinate files; see documentation for the EMBOSS application scope for further info.)

(2) DE - compound information. Text from the COMPND records from the original pdb file are given.

(3) OS - protein source information. Text from the SOURCE records from the original pdb file are given.

(4) EX - experimental information. The text 'nmr_or_model' (for nuclear magnetic resonance and model structures) or 'xray' (for structures determined by X-ray crystallography) appears as appropriate after the text 'METHOD'. The resolution of X-ray structures, or '0' for structures of type 'nmr_or_model', is given after 'RESO'. The number of models and number of polypeptide chains are given after 'NMOD' and 'NCHA' respectively. For domain coordinate files a 1 is always given. Following the EX record, the file will have a section containing a CN, IN and SQ records (see below) for each chain.

(5) CN - chain number. The number given in brackets after this record indicates the start of a section of chain-specific data.

(6) IN - chain specific data. The character given after ID is the PDB chain identifier (a '.' is given in cases where a chain identifier was not specified in the pdb file or, for domain coordinate files, the domain is comprised of more than one domain). The number of amino acid residues comprising the chain (or the chains from which a domain is comprised) is given after NR. The number of atoms in heterogens and water molecules are given after NH and NW respectively. Domain coordinate files do not include coordinates for these groups so a value of 0 is always given.

(7) SQ - protein sequence. The number of residues is given before AA on the first line. The protein sequence is given on subsequent lines.

(8) CO - coordinate data. The columns of the records are as follows.

  1. CO is always given.
  2. Model number (always 1 for domain coordinate files).
  3. Chain number (always 1 for domain coordinate files).
  4. Either P (a protein atom), H (a heterogen atom) or W (an atom in a water molecule).
  5. Position of the residue in the protein sequence given in the SQ record (for protein atoms) or a sequential count of the atoms (for heterogens and water).
  6. Residue number according to the original pdb file, or or a sequential count of the atoms (for heterogens and water).
  7. Single character amino acid code or a '.' (for heterogens and water).
  8. 3-character residue identifier code.
  9. Atom type.
  10. The x orthogonal coordinate.
  11. The y orthogonal coordinate.
  12. The z orthogonal coordinate.
  13. Occupancy.
  14. Temperature factor.

(9) XX - Used for spacing.

(10) // - Given on the last line of the file only.

Output file format

The PDB format is explained at:
http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
domainer writes the following records in PDB format:

(1) HEADER - bibliographic information. The text 'CLEANED-UP PDB FILE FOR SCOP DOMAIN XXXXXXX' is always given (where XXXXXXX is a 7-character domain identifier code).

(2) TITLE - bibliographic information. The text ' THIS FILE IS MISSING MOST RECORDS FROM THE ORIGINAL PDB FILE' is always given.

(3) COMPND - compound information. The COMPND records from the original pdb file are given.

(4) SOURCE - protein source information. The SOURCE records from the original PDB file are given.

(5) REMARK - remark records. Remark records are used for spacing. One REMARK line containing the protein resolution is always given.

(6) SEQRES - protein sequence.

(7) ATOM - atomic coordinates.

(8) TER - indicates the end of a chain.

The following is an example of an excerpt from an output clean domain coordinate file (PDB format):


HEADER     CLEANED-UP PDB FILE FOR SCOP DOMAIN D1HBBA_
TITLE      THIS FILE IS MISSING MOST RECORDS FROM THE ORIGINAL PDB FILE
COMPND     HEMOGLOBIN A (DEOXY, LOW SALT, 100MM CL)
SOURCE     HUMAN (HOMO SAPIENS)
REMARK
REMARK     RESOLUTION. 1.90  ANGSTROMS.
REMARK
SEQRES   1 A  141  VAL LEU SER PRO ALA ASP LYS THR ASN VAL LYS ALA ALA
SEQRES   2 A  141  TRP GLY LYS VAL GLY ALA HIS ALA GLY GLU TYR GLY ALA
SEQRES   3 A  141  GLU ALA LEU GLU ARG MET PHE LEU SER PHE PRO THR THR
SEQRES   4 A  141  LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER
SEQRES   5 A  141  ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA
SEQRES   6 A  141  LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN
SEQRES   7 A  141  ALA LEU SER ALA LEU SER ASP LEU HIS ALA HIS LYS LEU
SEQRES   8 A  141  ARG VAL ASP PRO VAL ASN PHE LYS LEU LEU SER HIS CYS
SEQRES   9 A  141  LEU LEU VAL THR LEU ALA ALA HIS LEU PRO ALA GLU PHE
SEQRES  10 A  141  THR PRO ALA VAL HIS ALA SER LEU ASP LYS PHE LEU ALA
SEQRES  11 A  141  SER VAL SER THR VAL LEU THR SER LYS TYR ARG
ATOM      1  N   VAL A   1       7.155  17.725   4.424  1.00 37.82           N
ATOM      2  CA  VAL A   1       7.854  18.800   3.718  1.00 35.10           C
ATOM      3  C   VAL A   1       9.366  18.565   3.754  1.00 31.92           C
ATOM      4  O   VAL A   1       9.861  17.961   4.721  1.00 35.01           O
ATOM      5  CB  VAL A   1       7.529  20.168   4.360  1.00 47.63           C
ATOM      6  CG1 VAL A   1       7.806  21.300   3.369  1.00 62.84           C
ATOM      7  CG2 VAL A   1       6.136  20.244   4.936  1.00 54.85           C
ATOM      8  N   LEU A   2      10.032  19.062   2.731  1.00 27.38           N
ATOM      9  CA  LEU A   2      11.496  18.967   2.657  1.00 23.24           C
ATOM     10  C   LEU A   2      12.077  20.110   3.496  1.00 22.99           C
ATOM     11  O   LEU A   2      11.672  21.259   3.289  1.00 25.22           O
ATOM     12  CB  LEU A   2      11.924  19.005   1.204  1.00 18.04           C
ATOM     13  CG  LEU A   2      11.563  17.855   0.286  1.00 17.80           C
ATOM     14  CD1 LEU A   2      12.166  18.109  -1.097  1.00 20.08           C
ATOM     15  CD2 LEU A   2      12.116  16.542   0.839  1.00 13.84           C
ATOM     16  N   SER A   3      12.979  19.784   4.391  1.00 22.22           N
ATOM     17  CA  SER A   3      13.652  20.792   5.257  1.00 20.53           C
ATOM     18  C   SER A   3      14.871  21.318   4.505  1.00 18.31           C
ATOM     19  O   SER A   3      15.273  20.709   3.496  1.00 17.73           O
ATOM     20  CB  SER A   3      14.084  20.042   6.534  1.00 17.61           C

domainer also writes out the clean domain coordinate files in EMBL-like format. The format for this EMBL-like data is described in the Input File format section of this document as it used the same format for the input clean protein EMBL-like data and the output clean domain EMBL-like data.

The following is an example of an excerpt from an output clean domain coordinate file (EMBL-like format):


ID   D1HBBA_
XX
DE   Co-ordinates for SCOP domain D1HBBA_
XX
OS   See Escop.dat for domain classification
XX
EX   METHOD xray; RESO 1.90; NMOD 1; NCHA 1;
XX
CN   [1]
XX
IN   ID A; NR 141; NH 0; NW 0;
XX
SQ   SEQUENCE   141 AA;  15127 MW;  5EC7DB1E CRC32;
     VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK
     KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA
     VHASLDKFLA SVSTVLTSKY R
XX
CO   1    1    P    1     1     V    VAL    N      7.155   17.725 4.424     1.00    37.82
CO   1    1    P    1     1     V    VAL    CA     7.854   18.800 3.718     1.00    35.10
CO   1    1    P    1     1     V    VAL    C      9.366   18.565 3.754     1.00    31.92
CO   1    1    P    1     1     V    VAL    O      9.861   17.961 4.721     1.00    35.01
CO   1    1    P    1     1     V    VAL    CB     7.529   20.168 4.360     1.00    47.63
CO   1    1    P    1     1     V    VAL    CG1    7.806   21.300 3.369     1.00    62.84
CO   1    1    P    1     1     V    VAL    CG2    6.136   20.244 4.936     1.00    54.85
CO   1    1    P    2     2     L    LEU    N     10.032   19.062 2.731     1.00    27.38
CO   1    1    P    2     2     L    LEU    CA    11.496   18.967 2.657     1.00    23.24
CO   1    1    P    2     2     L    LEU    C     12.077   20.110 3.496     1.00    22.99
CO   1    1    P    2     2     L    LEU    O     11.672   21.259 3.289     1.00    25.22

domainer generates a log file, an excerpt of which is shown below. If there is a problem in processing a domain, three lines containing the record '//', the domain identifier code and an error message respectively are written. The text 'WARN filename not found' is given in cases where a clean coordinate file could not be found. 'ERROR filename file read error' or 'ERROR filename file write error' will be reported when an error was encountered during a file read or write respectively. Various other error messages may also be given (in case of difficulty email Jon Ison, jison@hgmp.mrc.ac.uk).


//
DS002__
WARN  Could not open for reading cpdb file s002.pxyz
//
DS003__
WARN  Could not open for reading cpdb file s003.pxyz

Data files

The ready-made input and output data for domainer may be downloaded from the HGMP:

EMBL-like format clean protein coordinate files
EMBL-like format clean domain coordinate files
PDB-format clean domain coordinate files

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
aaindexextractExtract data from AAINDEX
cutgextractExtract data from CUTG
funkyReads clean coordinate files and writes file of protein-heterogen contact data
groupsRemoves redundant hits from a scop families file
hetparseConverts raw dictionary of heterogen groups to a file in embl-like format
nrscopeConverts redundant EMBL-format SCOP file to non-redundant one
pdbparseParses pdb files and writes cleaned-up protein coordinate files
pdbtospConvert raw swissprot:pdb equivalence file to embl-like format
printsextractExtract data from PRINTS
prosextractBuilds the PROSITE motif database for patmatmotifs to search
rebaseextractExtract data from REBASE
scopeConvert raw scop classification file to embl-like format
scopnrRemoves redundant domains from a scop classification file
scopparseConverts raw scop classification files to a file in embl-like format
scopseqsAdds pdb and swissprot sequence records to a scop classification file
tfextractExtract data from TRANSFAC

Author(s)

This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)

History

Written (Jan 2001) - Jon Ison.

Target users

This program is intended to be run by EMBOSS site maintainers or those responsible for setting up and maintaining protein 3D structural data for use by others.

Comments