EMBOSS: patmatmotifs

 

Function

Search a PROSITE motif database with a protein sequence

Description

patmatmotifs takes a protein sequence and compares it to the PROSITE database of motifs.

For a description of PROSITE, we can do no better than to quote the PROSITE user's documentation:

PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we can say that "some regions of a protein sequence are more equal than others" !

The use of protein sequence patterns (or motifs) to determine the function(s) of proteins is becoming very rapidly one of the essential tools of sequence analysis. This reality has been recognized by many authors, as it can be illustrated from the following citations from two of the most well known experts of protein sequence analysis, R.F. Doolittle and A.M. Lesk:

      "There are  many short  sequences  that  are  often  (but  not  always)
      diagnostics of certain binding properties or active sites. These can be
      set into a small subcollection and searched against your sequence (1)".

      "In some  cases, the structure and function of an unknown protein which
      is too  distantly related  to any  protein of known structure to detect
      its affinity  by overall  sequence alignment  may be  identified by its
      possession of  a particular  cluster of  residues types classified as a
      motifs. The  motifs, or  templates, or  fingerprints, arise  because of
      particular  requirements  of  binding  sites  that  impose  very  tight
      constraint on the evolution of portions of a protein sequence (2)."

The home web page of PROSITE is: http://www.expasy.ch/prosite/

It is common to find that a search of the PROSITE database against a protein sequence will report many matches to the short motifs that are indicative of the post-translational modification sites, such as glycolsylation, myristylation and phosphorylation sites. These reports are often unwanted and are not normally reported. You can turn reporting of these short motifs on by giving the '-noprune' option on the command-line.

Your EMBOSS administrator must have set up the local EMBOSS PROSITE database using the utility 'prosextract' before this program will run.

Usage

Here is a sample session with patmatmotifs.

% patmatmotifs -full
Matching Prosite Motif Database to a single sequence.
Input sequence: sw:12s1_arath
Output file [12s1_arath.patmatmotifs]: 

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
  [-outfile]           report     Output report file name

   Optional qualifiers:
   -full               boolean    Provide full documentation for matching
                                  patterns
   -[no]prune          boolean    Ignore simple patterns. If this is true then
                                  these simple post-translational
                                  modification sites are not reported:
                                  myristyl, asn_glycosylation,
                                  camp_phospho_site, pkc_phospho_site,
                                  ck2_phospho_site, and tyr_phospho_site.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-outfile]
(Parameter 2)
Output report file name Report file  
Optional qualifiers Allowed values Default
-full Provide full documentation for matching patterns Yes/No No
-[no]prune Ignore simple patterns. If this is true then these simple post-translational modification sites are not reported: myristyl, asn_glycosylation, camp_phospho_site, pkc_phospho_site, ck2_phospho_site, and tyr_phospho_site. Yes/No Yes
Advanced qualifiers Allowed values Default
(none)

Input file format

A protein sequence USA.

Output file format

The output is a standard EMBOSS report file.

The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for further information on report formats.

By default patmatmotifs writes a 'dbmotif' report file.

The output from the above example follows:


########################################
# Program: patmatmotifs
# Rundate: Thu Apr 11 13:53:51 2002
# Report_file: 12s1_arath.patmatmotifs
########################################

#=======================================
#
# Sequence: 12S1_ARATH     from: 1   to: 472
# HitCount: 1
#
# Full: Yes
# Prune: Yes
# Data_file: /packages/emboss_dev/gwilliam/emboss/emboss/emboss/data/PROSITE/pro
site.lines
#
#=======================================

Length = 23
Start = position 282 of sequence
End = position 304 of sequence
 

Motif = 11S_SEED_STORAGE

HGRHGNGLEETICSARCTDNLDDPSRADVYKPQ
     |                     |
   282                     304


#---------------------------------------
#
# Motif: 11S_SEED_STORAGE
# Count: 1
#
# **********************************************
# * 11-S plant seed storage proteins signature *
# **********************************************
#
# Plant seed storage proteins, whose  principal function appears to be the major
# nitrogen  source for the developing plant,  can be classified, on the basis of
# their structure, into different families.  11-S are non-glycosylated  proteins
# which form hexameric structures [1,2].  Each of the subunits in the hexamer is
# itself composed of an acidic and a basic chain derived from a single precursor
# and linked  by a  disulfide bond.   This  structure is  shown in the following
# representation.
#
#                    +-------------------------+
#                    |                         |
#         xxxxxxxxxxxCxxxxxxxxxxxxxxxxxxxxxxNGxCxxxxxxxxxxxxxxxxxxxxxxx
#                                           *********
#         <------Acidic-subunit-------------><-----Basic-subunit------>
#         <-----------------About-480-to-500-residues----------------->
#
# 'C': conserved cysteine involved in a disulfide bond.
# '*': position of the pattern.
#
# Proteins that belong to the 11-S family are: pea and broad bean legumins, rape
# cruciferin, rice glutelins,  cotton beta-globulins, soybean glycinins, pumpkin
# 11-S globulin, oat globulin, sunflower helianthinin G3, etc.
#
# As a signature  pattern  for  this  family of proteins we used the region that
# includes the  conserved  cleavage  site between  the acidic and basic subunits
# (Asn-Gly) and a  proximal cysteine residue which is involved in the interchain
# disulfide bond.
#
# -Consensus pattern: N-G-x-[DE](2)-x-[LIVMF]-C-[ST]-x(11,12)-[PAG]-D
#                     [C is involved in a disulfide bond]
# -Sequences known to belong to this class detected by the pattern: ALL.
# -Other sequence(s) detected in SWISS-PROT: NONE.
# -Last update: June 1994 / Pattern and text revised.
#
# [ 1] Hayashi M., Mori H., Nishimura M., Akazawa T., Hara-Nishimura I.
#      Eur. J. Biochem. 172:627-632(1988).
# [ 2] Shotwell M.A., Afonso C., Davies E., Chesnut R.S., Larkins B.A.
#      Plant Physiol. 87:698-704(1988).
#
# ***************
#
#
#---------------------------------------

Data files

Data and documentation from PROSITE files is automatically read. This has been generated and formatted by running prosextract before running patmatmotifs.

Notes

Program is only useful when prosextract is used beforehand.

References

If you want to refer to PROSITE in a publication you can do so by citing:

Bairoch A., Bucher P., Hofmann K. The PROSITE datatase, its status in 1997. Nucleic Acids Res. 24:217-221(1997).

Other references:

  1. Bairoch, A., Bucher P. (1994) PROSITE: recent developments. Nucleic Acids Research, Vol 22, No.17 3583-3589.
  2. Bairoch, A., (1992) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research, Vol 20, Supplement, 2013-2018.
  3. Peek, J., O'Reilly, T., Loukides, M., (1997) Unix Power Tools, 2nd Edition.
  4. Doolittle R.F. (In) Of URFs and ORFs: a primer on how to analyze derived amino acid sequences., University Science Books, Mill Valley, California, (1986).
  5. Lesk A.M. (In) Computational Molecular Biology, Lesk A.M., Ed., pp17-26, Oxford University Press, Oxford (1988).

Warnings

Your EMBOSS administrator must have set up the local EMBOSS PROSITE database using the utility 'prosextract' before this program will run.

Diagnostic Error Messages

The error message:

"Either EMBOSS_DATA undefined or PROSEXTRACT needs running"

indicates that your local EMBOSS administrator has not yet correctly set up the local EMBOSS PROSITE database using the utility 'prosextract'.

Exit status

It always exits with status 0

Known bugs

None.

See also

Program nameDescription
antigenicFinds antigenic sites in proteins
digestProtein proteolytic enzyme or reagent cleavage digest
fuzzproProtein pattern search
fuzztranProtein pattern search after translation
helixturnhelixReport nucleic acid binding motifs
oddcompFinds protein sequence regions with a biased composition
patmatdbSearch a protein sequence with a motif
pepcoilPredicts coiled coil regions
pregRegular expression search of a protein sequence
pscanScans proteins using PRINTS
sigcleaveReports protein signal cleavage sites

Author(s)

This application was written by Sinead O'Leary (soleary@hgmp.mrc.ac.uk)

History

Completed May 13 1999.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments