pepstats

 

Function

Protein statistics

Description

pepstats outputs a report of simple protein sequence information including:

DayhoffStat is the amino acid's Dayhoff statistic divided by the molar percent. The Dayhoff statistic is the amino acid's relative occurence per 1000 aa normalised to 100 by rls@ebi.ac.uk (original work from 1993)

The probability of expression in inclusion bodies is sometimes referred to as a type of solubility measure. If, however, a recombinant protein is expressed in Escherichia coli, it can be expressed as soluble in the cytosol or insoluble in inclusion bodies. If the Harrison model predicts a given protein to be probably expressed in includion bodies, this doesn't mean that it is not possible to get it soluble in the cytosol. One example: Thermatoga maritima cell divison protein FtsA with a C-terminal His-Tag has a 58% Harrison probability of being expressed in inclusion bodies. However, there was plenty of soluble protein in the E. coli cytosol (F. van den Ent and J. Lowe, EMBO J. 19, 5300-5307 2000). If the protein is expressed in inclusion bodies or not is not only dependent on the sequence, but also on many other factors, such as E. coli strain, incubation temperature, type of expression vector, strength of promoter and medium.

Usage

Here is a sample session with pepstats.

% pepstats
Protein statistics
Input sequence: sw:laci_ecoli
Output file [laci_ecoli.pepstats]:

Command line arguments

   Mandatory qualifiers:
  [-sequencea]         sequence   Sequence USA
   -outfile            outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -[no]termini        boolean    Include charge at N and C terminus
   -aadata             string     Molecular weight data for amino acids

   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequencea]
(Parameter 1)
Sequence USA Readable sequence Required
-outfile Output file name Output file <sequence>.pepstats
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-[no]termini Include charge at N and C terminus Yes/No Yes
-aadata Molecular weight data for amino acids Any string is accepted Eamino.dat

Input file format

Normal protein sequence USA.

Output file format

Here is the output from the example run:


PEPSTATS of LACI_ECOLI from 1 to 360

Molecular weight = 38563.97             Residues = 360
Average Residue Weight  = 107.122       Charge   = 1.5
Isoelectric Point = 6.8820
Improbability of expression in inclusion bodies = 0.670

Residue         Number          Mole%           DayhoffStat
A = Ala         44              12.222          1.421
B = Asx         0               0.000           0.000
C = Cys         3               0.833           0.287
D = Asp         17              4.722           0.859
E = Glu         15              4.167           0.694
F = Phe         4               1.111           0.309
G = Gly         22              6.111           0.728
H = His         7               1.944           0.972
I = Ile         18              5.000           1.111
K = Lys         11              3.056           0.463
L = Leu         40              11.111          1.502
M = Met         10              2.778           1.634
N = Asn         12              3.333           0.775
P = Pro         14              3.889           0.748
Q = Gln         28              7.778           1.994
R = Arg         19              5.278           1.077
S = Ser         33              9.167           1.310
T = Thr         19              5.278           0.865
V = Val         34              9.444           1.431
W = Trp         2               0.556           0.427
X = Xaa         0               0.000           0.000
Y = Tyr         8               2.222           0.654
Z = Glx         0               0.000           0.000

Property        Residues                Number          Mole%
Tiny            (A+C+G+S+T)             121             33.611
Small           (A+B+C+D+G+N+P+S+T+V)   198             55.000
Aliphatic       (I+L+V)                 92              25.556
Aromatic        (F+H+W+Y)               21               5.833
Non-polar       (A+C+F+G+I+L+M+P+V+W+Y) 199             55.278
Polar           (D+E+H+K+N+Q+R+S+T+Z)   161             44.722
Charged         (B+D+E+H+K+R+Z)         69              19.167
Basic           (H+K+R)                 37              10.278
Acidic          (B+D+E+Z)               32               8.889   

Data files

The Dayhoff statistic is read from the EMBOSS data file 'Edayhoff.freq'. You can inspect and modify this file by copying it into your current directory with the command: 'embossdata -fetch'.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

  1. Roger G. Harrison "Expression of soluble heterologous proteins via fusion with NusA protein" in inNovations 11, June 2000, p 4 - 7.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
backtranseqBack translate a protein sequence
chargeProtein charge plot
checktransReports STOP codons and ORF statistics of a protein sequence
compseqCounts the composition of dimer/trimer/etc words in a sequence
emowseProtein identification by mass spectrometry
freakResidue/base frequency table or plot
iepCalculates the isoelectric point of a protein
mwcontamShows molwts that match across a set of files
mwfilterFilter noisy molwts from mass spec output
octanolDisplays protein hydropathy
pepinfoPlots simple amino acid properties in parallel
pepwindowDisplays protein hydropathy
pepwindowallDisplays protein hydropathy of a set of sequences

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Written (1999) - Alan Bleasby

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments