wordcount

 

Function

Counts words of a specified size in a DNA sequence

Description

Displays all the words of the specified length with the number of
times it occurs.

Usage

Here is a sample session with wordcount.

% wordcount embl:rnu68037 -wordsize=3
Counts words of a specified size in a DNA sequence
Output file [rnu68037.wordcount]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
   -wordsize           integer    Word size
   -outfile            outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 4
-outfile Output file name Output file <sequence>.wordcount
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

Any sequence USA.

Output file format

The output sequence produced in the above example is:


ctg     54
tgg     53
gcc     53
ggc     51
cgc     47
gct     47
gtg     40
tgc     39
cct     38
gcg     36
cca     29
ggg     26
cag     25
ctt     25
tcc     25
ggt     24
ccc     24
ctc     23
tgt     23
gca     22
cgt     22
ccg     22
cac     22
agc     21
acg     19
ttg     19
cgg     19
tcg     18
ttc     17
cat     17
agg     17
act     16
gtc     16
gag     16
aac     15
gga     14
atc     14
tct     14
tca     13
cta     13
atg     12
gtt     11
acc     11
gta     11
aca     10
tac     10
tga     10
caa     10
gac     9
agt     9
tag     9
ttt     8
cga     7
gat     6
taa     6
tat     5
aga     5
gaa     4
aat     3
ata     3
tta     3
att     3
aag     2
aaa     1

The file simply consists of two columns, separated by spaces or TAB characters.

The first column consists of all the possible words of size wordsize. The second column consists of the count of those words in the input sequence.

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None.

See also

Program nameDescription
bananaBending and curvature plot in B-DNA
btwistedCalculates the twisting in a B-DNA sequence
chaosCreate a chaos game representation plot for a sequence
compseqCounts the composition of dimer/trimer/etc words in a sequence
danCalculates DNA RNA/DNA melting temperature
freakResidue/base frequency table or plot
isochorePlots isochores in large DNA sequences

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Completed 27th November 1998.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments