![]() |
compseq |
It can read in the result of a previous compseq analysis and use this to set the expected frequencies of the subsequences.
Unless you tell 'compseq' otherwise, it expects each word to be equally likely. The 'Expected' frequency therefore of any dimer is 1/16 - this is simply the inverse of the number of possible dimers (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT).
Similarly, the 'Expected' frequency of any trimer is 1/64, etc.
Obviously this is not the case in real sequences - there will be bias in favour of some words.
Compseq cannot otherwise guess what the 'Expected' frequency is. You can, however, tell it what the Expected frequencies are by giving compseq the output of the analysis of another set of sequences, produced by a previous compseq run.
So you take a set of sequences that are representative of the type of sequence you expect and you run compseq on it to get your expected sequence frequencies.
You then take the sequences you wish to investigate, run compseq on them giving compseq the expected frequencies that you have established, above. You tell compseq what the file of expected frequencies is by specifying it with '-infile filename' on the command-line.
To count the frequencies of dinucleotides in a file: % compseq embl:hsfau 2 result3.comp To count the frequencies of hexanucleotides, without outputting the results of hexanucleotides that do not occur in the sequence: % compseq embl:hsfau 6 result6.comp -nozero To count the frequencies of trinucleotides in frame 2 of a sequence and use a previously prepared compseq output to show the expected frequencies: % compseq embl:hsfau 3 result3.comp -frame 2 -in prev.comp
Mandatory qualifiers: [-sequence] seqall Sequence database USA [-word] integer This is the size of word (n-mer) to count. Thus if you want to count codon frequencies, you should enter 3 here. [-outfile] outfile This is the results file. Optional qualifiers (* if not always prompted): -infile infile This is a file previously produced by 'compseq' that can be used to set the expected frequencies of words in this analysis. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you and analysing a nucleotide sequence. -frame integer The normal behaviour of 'compseq' is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. This option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on. * -[no]ignorebz boolean The amino acid code B represents Asparagine or Aspartic acid and the code Z represents Glutamine or Glutamic acid. These are not commonly used codes and you may wish not to count words containing them, just noting them in the count of 'Other' words. * -reverse boolean Set this to be true if you also wish to also count words in the reverse complement of a nucleic sequence. -[no]zerocount boolean You can make the output results file much smaller if you do not display the words with a zero count. Advanced qualifiers: (none) General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required |
[-word] (Parameter 2) |
This is the size of word (n-mer) to count. Thus if you want to count codon frequencies, you should enter 3 here. | Integer from 1 to 20 | 2 |
[-outfile] (Parameter 3) |
This is the results file. | Output file | <sequence>.compseq |
Optional qualifiers | Allowed values | Default | |
-infile | This is a file previously produced by 'compseq' that can be used to set the expected frequencies of words in this analysis. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you and analysing a nucleotide sequence. | Input file | Required |
-frame | The normal behaviour of 'compseq' is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. This option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on. | Integer 0 or more | 0 |
-[no]ignorebz | The amino acid code B represents Asparagine or Aspartic acid and the code Z represents Glutamine or Glutamic acid. These are not commonly used codes and you may wish not to count words containing them, just noting them in the count of 'Other' words. | Yes/No | Yes |
-reverse | Set this to be true if you also wish to also count words in the reverse complement of a nucleic sequence. | Yes/No | No |
-[no]zerocount | You can make the output results file much smaller if you do not display the words with a zero count. | Yes/No | Yes |
Advanced qualifiers | Allowed values | Default | |
(none) |
Header information and comments are preceeded by a '#' character at the start of the line.
The Word size and the Total count are then given on separate lines,
The headers of the columns of results are preceeded by a '#'
The results columns are: the sub-sequence word, the observed frequency, the expected frequency (which will be read from the input file if one is given, else it is a simple inverse of the number of words of the size specified that can be constructed), the ratio of the observed to expected frequency.
After a blank line at the end, the results of 'Other' words is given - this is the number of words with a sequence which has IUPAC ambiguity codes or other unusual characters in.
Example:
# # Output from 'compseq' # # The Expected frequencies are taken from the file: jjj.composition # # The input sequences are: # jjj Word size 2 Total count 196 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # AA 0 0.0000000 0.0000000 10000000000.0000000 AC 18 0.0918367 0.0918367 1.0000004 AG 8 0.0408163 0.0408163 1.0000007 AT 12 0.0612245 0.0612245 0.9999998 CA 3 0.0153061 0.0153061 1.0000015 CC 1 0.0051020 0.0051020 1.0000080 CG 16 0.0816327 0.0816327 0.9999994 CT 15 0.0765306 0.0765306 1.0000002 GA 16 0.0816327 0.0816327 0.9999994 GC 13 0.0663265 0.0663265 1.0000005 GG 5 0.0255102 0.0255102 1.0000002 GT 18 0.0918367 0.0918367 1.0000004 TA 19 0.0969388 0.0969388 0.9999997 TC 4 0.0204082 0.0204082 0.9999982 TG 22 0.1122449 0.1122449 1.0000000 TT 5 0.0255102 0.0255102 1.0000002 Other 21 0.0255102 0.1071429 0.2380951
The input data file format is exactly the same as the output file format.
It expects to read in a previous output file of this program. An error is produced if the word size of the current compseq job and that of the output file being read in are different.
You can produce very large output files if you choose large values of wordsize.
If you run out of memory, you may see the program crash with a generic error message that will be specific to your machine's operating system, but will probably be a warning about writing to memory that the program does not own (eg "Segmentation fault" on a Solaris machine)
This is not a bug, it is a feature of the way this program grabs large amounts of memory.
Program name | Description |
---|---|
backtranseq | Back translate a protein sequence |
banana | Bending and curvature plot in B-DNA |
btwisted | Calculates the twisting in a B-DNA sequence |
chaos | Create a chaos game representation plot for a sequence |
charge | Protein charge plot |
checktrans | Reports STOP codons and ORF statistics of a protein sequence |
dan | Calculates DNA RNA/DNA melting temperature |
emowse | Protein identification by mass spectrometry |
freak | Residue/base frequency table or plot |
iep | Calculates the isoelectric point of a protein |
isochore | Plots isochores in large DNA sequences |
mwcontam | Shows molwts that match across a set of files |
mwfilter | Filter noisy molwts from mass spec output |
octanol | Displays protein hydropathy |
pepinfo | Plots simple amino acid properties in parallel |
pepstats | Protein statistics |
pepwindow | Displays protein hydropathy |
pepwindowall | Displays protein hydropathy of a set of sequences |
wordcount | Counts words of a specified size in a DNA sequence |