![]() |
EMBOSS: patmatmotifs |
For a description of PROSITE, we can do no better than to quote the PROSITE user's documentation:
PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.
In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we can say that "some regions of a protein sequence are more equal than others" !
The use of protein sequence patterns (or motifs) to determine the function(s) of proteins is becoming very rapidly one of the essential tools of sequence analysis. This reality has been recognized by many authors, as it can be illustrated from the following citations from two of the most well known experts of protein sequence analysis, R.F. Doolittle and A.M. Lesk:
"There are many short sequences that are often (but not always) diagnostics of certain binding properties or active sites. These can be set into a small subcollection and searched against your sequence (1)". "In some cases, the structure and function of an unknown protein which is too distantly related to any protein of known structure to detect its affinity by overall sequence alignment may be identified by its possession of a particular cluster of residues types classified as a motifs. The motifs, or templates, or fingerprints, arise because of particular requirements of binding sites that impose very tight constraint on the evolution of portions of a protein sequence (2)."
The home web page of PROSITE is: http://www.expasy.ch/prosite/
It is common to find that a search of the PROSITE database against a protein sequence will report many matches to the short motifs that are indicative of the post-translational modification sites, such as glycolsylation, myristylation and phosphorylation sites. These reports are often unwanted and are not normally reported. You can turn reporting of these short motifs on by giving the '-noprune' option on the command-line.
Your EMBOSS administrator must have set up the local EMBOSS PROSITE database using the utility 'prosextract' before this program will run.
% patmatmotifs -full Matching Prosite Motif Database to a single sequence. Input sequence: sw:12s1_arath Output file [12s1_arath.patmatmotifs]:
Mandatory qualifiers: [-sequence] sequence Sequence USA [-outfile] report Output report file name Optional qualifiers: -full boolean Provide full documentation for matching patterns -[no]prune boolean Ignore simple patterns. If this is true then these simple post-translational modification sites are not reported: myristyl, asn_glycosylation, camp_phospho_site, pkc_phospho_site, ck2_phospho_site, and tyr_phospho_site. Advanced qualifiers: (none) General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
Sequence USA | Readable sequence | Required |
[-outfile] (Parameter 2) |
Output report file name | Report file | |
Optional qualifiers | Allowed values | Default | |
-full | Provide full documentation for matching patterns | Yes/No | No |
-[no]prune | Ignore simple patterns. If this is true then these simple post-translational modification sites are not reported: myristyl, asn_glycosylation, camp_phospho_site, pkc_phospho_site, ck2_phospho_site, and tyr_phospho_site. | Yes/No | Yes |
Advanced qualifiers | Allowed values | Default | |
(none) |
The output is a standard EMBOSS report file.
The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq
See: http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for further information on report formats.
By default patmatmotifs writes a 'dbmotif' report file.
The output from the above example follows:
######################################## # Program: patmatmotifs # Rundate: Thu Apr 11 13:53:51 2002 # Report_file: 12s1_arath.patmatmotifs ######################################## #======================================= # # Sequence: 12S1_ARATH from: 1 to: 472 # HitCount: 1 # # Full: Yes # Prune: Yes # Data_file: /packages/emboss_dev/gwilliam/emboss/emboss/emboss/data/PROSITE/pro site.lines # #======================================= Length = 23 Start = position 282 of sequence End = position 304 of sequence Motif = 11S_SEED_STORAGE HGRHGNGLEETICSARCTDNLDDPSRADVYKPQ | | 282 304 #--------------------------------------- # # Motif: 11S_SEED_STORAGE # Count: 1 # # ********************************************** # * 11-S plant seed storage proteins signature * # ********************************************** # # Plant seed storage proteins, whose principal function appears to be the major # nitrogen source for the developing plant, can be classified, on the basis of # their structure, into different families. 11-S are non-glycosylated proteins # which form hexameric structures [1,2]. Each of the subunits in the hexamer is # itself composed of an acidic and a basic chain derived from a single precursor # and linked by a disulfide bond. This structure is shown in the following # representation. # # +-------------------------+ # | | # xxxxxxxxxxxCxxxxxxxxxxxxxxxxxxxxxxNGxCxxxxxxxxxxxxxxxxxxxxxxx # ********* # <------Acidic-subunit-------------><-----Basic-subunit------> # <-----------------About-480-to-500-residues-----------------> # # 'C': conserved cysteine involved in a disulfide bond. # '*': position of the pattern. # # Proteins that belong to the 11-S family are: pea and broad bean legumins, rape # cruciferin, rice glutelins, cotton beta-globulins, soybean glycinins, pumpkin # 11-S globulin, oat globulin, sunflower helianthinin G3, etc. # # As a signature pattern for this family of proteins we used the region that # includes the conserved cleavage site between the acidic and basic subunits # (Asn-Gly) and a proximal cysteine residue which is involved in the interchain # disulfide bond. # # -Consensus pattern: N-G-x-[DE](2)-x-[LIVMF]-C-[ST]-x(11,12)-[PAG]-D # [C is involved in a disulfide bond] # -Sequences known to belong to this class detected by the pattern: ALL. # -Other sequence(s) detected in SWISS-PROT: NONE. # -Last update: June 1994 / Pattern and text revised. # # [ 1] Hayashi M., Mori H., Nishimura M., Akazawa T., Hara-Nishimura I. # Eur. J. Biochem. 172:627-632(1988). # [ 2] Shotwell M.A., Afonso C., Davies E., Chesnut R.S., Larkins B.A. # Plant Physiol. 87:698-704(1988). # # *************** # # #---------------------------------------
Bairoch A., Bucher P., Hofmann K. The PROSITE datatase, its status in 1997. Nucleic Acids Res. 24:217-221(1997).
Other references:
"Either EMBOSS_DATA undefined or PROSEXTRACT needs running"
indicates that your local EMBOSS administrator has not yet correctly set up the local EMBOSS PROSITE database using the utility 'prosextract'.
Program name | Description |
---|---|
antigenic | Finds antigenic sites in proteins |
digest | Protein proteolytic enzyme or reagent cleavage digest |
fuzzpro | Protein pattern search |
fuzztran | Protein pattern search after translation |
helixturnhelix | Report nucleic acid binding motifs |
oddcomp | Finds protein sequence regions with a biased composition |
patmatdb | Search a protein sequence with a motif |
pepcoil | Predicts coiled coil regions |
preg | Regular expression search of a protein sequence |
pscan | Scans proteins using PRINTS |
sigcleave | Reports protein signal cleavage sites |