Class Bio::FastaFormat
In: lib/bio/db/fasta.rb
Parent: DB

Treats a FASTA formatted entry, such as:

  >id and/or some comments                    <== definition line
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC        <== sequence lines
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC
  ATGCATGCATGC

The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.

Examples

  fasta_string = <<END_OF_STRING
  >gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]
  MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI
  VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ
  NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP
  IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP
  INRISARRAAIHPYFQES
  END_OF_STRING

  f = Bio::FastaFormat.new(fasta_string)

  f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+
  # MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+
  # VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+
  # NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+
  # IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+
  # INRISARRAAIHPYFQES"

Methods related to the name of the sequence

A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.

  f.entry_id #=> "gi|398365175"
  f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]"
  f.identifiers #=> Bio::FastaDefline instance
  f.accession #=> "NP_009718"
  f.accessions #=> ["NP_009718"]
  f.acc_version #=> "NP_009718.3"
  f.comment #=> nil

Methods related to the actual sequence

  f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
  f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n"
  f.length #=> 298
  f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
  f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4}
  f.aalen #=> 298

A less structured fasta entry

  f.entry #=> ">abc 123 456\nASDF"

  f.entry_id #=> "abc"
  f.definition #=> "abc 123 456"
  f.comment #=> nil
  f.accession #=> nil
  f.accessions #=> []
  f.acc_version #=> nil

  f.seq #=> "ASDF"
  f.data #=> "\nASDF\n"
  f.length #=> 4
  f.aaseq #=> "ASDF"
  f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1}
  f.aalen #=> 4

References

Methods

aalen   aaseq   acc_version   accession   accessions   blast   comment   entry   entry_id   fasta   gi   identifiers   length   locus   nalen   naseq   new   query   seq   to_biosequence   to_s   to_seq  

Constants

DELIMITER = RS = "\n>"   Entry delimiter in flatfile text.
DELIMITER_OVERRUN = 1   (Integer) excess read size included in DELIMITER.

Attributes

data  [RW]  The seuqnce lines in text.
definition  [RW]  The comment line of the FASTA formatted data.
entry_overrun  [R] 

Public Class methods

Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.

[Source]

     # File lib/bio/db/fasta.rb, line 131
131:     def initialize(str)
132:       @definition = str[/.*/].sub(/^>/, '').strip       # 1st line
133:       @data = str.sub(/.*/, '')                         # rests
134:       @data.sub!(/^>.*/m, '')   # remove trailing entries for sure
135:       @entry_overrun = $&
136:     end

Public Instance methods

Returens the length of Bio::Sequence::AA.

[Source]

     # File lib/bio/db/fasta.rb, line 221
221:     def aalen
222:       self.aaseq.length
223:     end

Returens the Bio::Sequence::AA.

[Source]

     # File lib/bio/db/fasta.rb, line 216
216:     def aaseq
217:       Sequence::AA.new(seq)
218:     end

Returns accession number with version.

[Source]

     # File lib/bio/db/fasta.rb, line 277
277:     def acc_version
278:       identifiers.acc_version
279:     end

Returns an accession number.

[Source]

     # File lib/bio/db/fasta.rb, line 265
265:     def accession
266:       identifiers.accession
267:     end

Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.

[Source]

     # File lib/bio/db/fasta.rb, line 272
272:     def accessions
273:       identifiers.accessions
274:     end
blast(factory)

Alias for query

Returns comments.

[Source]

     # File lib/bio/db/fasta.rb, line 195
195:     def comment
196:       seq
197:       @comment
198:     end

Returns the stored one entry as a FASTA format. (same as to_s)

[Source]

     # File lib/bio/db/fasta.rb, line 139
139:     def entry
140:       @entry = ">#{@definition}\n#{@data.strip}\n"
141:     end

Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.

[Source]

     # File lib/bio/db/fasta.rb, line 251
251:     def entry_id
252:       identifiers.entry_id
253:     end
fasta(factory)

Alias for query

Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.

[Source]

     # File lib/bio/db/fasta.rb, line 260
260:     def gi
261:       identifiers.gi
262:     end

Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.

[Source]

     # File lib/bio/db/fasta.rb, line 241
241:     def identifiers
242:       unless defined?(@ids) then
243:         @ids = FastaDefline.new(@definition)
244:       end
245:       @ids
246:     end

Returns sequence length.

[Source]

     # File lib/bio/db/fasta.rb, line 201
201:     def length
202:       seq.length
203:     end

Returns locus.

[Source]

     # File lib/bio/db/fasta.rb, line 282
282:     def locus
283:       identifiers.locus
284:     end

Returens the length of Bio::Sequence::NA.

[Source]

     # File lib/bio/db/fasta.rb, line 211
211:     def nalen
212:       self.naseq.length
213:     end

Returens the Bio::Sequence::NA.

[Source]

     # File lib/bio/db/fasta.rb, line 206
206:     def naseq
207:       Sequence::NA.new(seq)
208:     end

Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.

  #!/usr/bin/env ruby
  require 'bio'

  factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
  flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
  flatfile.each do |entry|
    p entry.definition
    result = entry.fasta(factory)
    result.each do |hit|
      print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
      p hit.lap_at
    end
  end

[Source]

     # File lib/bio/db/fasta.rb, line 162
162:     def query(factory)
163:       factory.query(entry)
164:     end

Returns a joined sequence line as a String.

[Source]

     # File lib/bio/db/fasta.rb, line 169
169:     def seq
170:       unless defined?(@seq)
171:         unless /\A\s*^\#/ =~ @data then
172:           @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
173:         else
174:           a = @data.split(/(^\#.*$)/)
175:           i = 0
176:           cmnt = {}
177:           s = []
178:           a.each do |x|
179:             if /^# ?(.*)$/ =~ x then
180:               cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
181:             else
182:               x.tr!(" \t\r\n0-9", '') # lazy clean up
183:               i += x.length
184:               s << x
185:             end
186:           end
187:           @comment = cmnt
188:           @seq = Bio::Sequence::Generic.new(s.join(''))
189:         end
190:       end
191:       @seq
192:     end

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.

[Source]

     # File lib/bio/db/fasta.rb, line 232
232:     def to_biosequence
233:       Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat)
234:     end
to_s()

Alias for entry

to_seq()

Alias for to_biosequence

[Validate]