Class | Bio::FastaFormat |
In: |
lib/bio/db/fasta.rb
|
Parent: | DB |
Treats a FASTA formatted entry, such as:
>id and/or some comments <== definition line ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines ATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGC
The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.
fasta_string = <<END_OF_STRING >gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c] MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP INRISARRAAIHPYFQES END_OF_STRING f = Bio::FastaFormat.new(fasta_string) f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+ # MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+ # VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+ # NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+ # IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+ # INRISARRAAIHPYFQES"
A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.
f.entry_id #=> "gi|398365175" f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]" f.identifiers #=> Bio::FastaDefline instance f.accession #=> "NP_009718" f.accessions #=> ["NP_009718"] f.acc_version #=> "NP_009718.3" f.comment #=> nil
f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES" f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n" f.length #=> 298 f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES" f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4} f.aalen #=> 298
f.entry #=> ">abc 123 456\nASDF" f.entry_id #=> "abc" f.definition #=> "abc 123 456" f.comment #=> nil f.accession #=> nil f.accessions #=> [] f.acc_version #=> nil f.seq #=> "ASDF" f.data #=> "\nASDF\n" f.length #=> 4 f.aaseq #=> "ASDF" f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1} f.aalen #=> 4
DELIMITER | = | RS = "\n>" | Entry delimiter in flatfile text. | |
DELIMITER_OVERRUN | = | 1 | (Integer) excess read size included in DELIMITER. |
data | [RW] | The seuqnce lines in text. |
definition | [RW] | The comment line of the FASTA formatted data. |
entry_overrun | [R] |
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
# File lib/bio/db/fasta.rb, line 131 131: def initialize(str) 132: @definition = str[/.*/].sub(/^>/, '').strip # 1st line 133: @data = str.sub(/.*/, '') # rests 134: @data.sub!(/^>.*/m, '') # remove trailing entries for sure 135: @entry_overrun = $& 136: end
Returens the Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 216 216: def aaseq 217: Sequence::AA.new(seq) 218: end
Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.
# File lib/bio/db/fasta.rb, line 272 272: def accessions 273: identifiers.accessions 274: end
Returns comments.
# File lib/bio/db/fasta.rb, line 195 195: def comment 196: seq 197: @comment 198: end
Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.
# File lib/bio/db/fasta.rb, line 251 251: def entry_id 252: identifiers.entry_id 253: end
Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
# File lib/bio/db/fasta.rb, line 260 260: def gi 261: identifiers.gi 262: end
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.
# File lib/bio/db/fasta.rb, line 241 241: def identifiers 242: unless defined?(@ids) then 243: @ids = FastaDefline.new(@definition) 244: end 245: @ids 246: end
Returens the length of Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 211 211: def nalen 212: self.naseq.length 213: end
Returens the Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 206 206: def naseq 207: Sequence::NA.new(seq) 208: end
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby require 'bio' factory = Bio::Fasta.local('fasta34', 'db/swissprot.f') flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f') flatfile.each do |entry| p entry.definition result = entry.fasta(factory) result.each do |hit| print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at " p hit.lap_at end end
# File lib/bio/db/fasta.rb, line 162 162: def query(factory) 163: factory.query(entry) 164: end
Returns a joined sequence line as a String.
# File lib/bio/db/fasta.rb, line 169 169: def seq 170: unless defined?(@seq) 171: unless /\A\s*^\#/ =~ @data then 172: @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up 173: else 174: a = @data.split(/(^\#.*$)/) 175: i = 0 176: cmnt = {} 177: s = [] 178: a.each do |x| 179: if /^# ?(.*)$/ =~ x then 180: cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 181: else 182: x.tr!(" \t\r\n0-9", '') # lazy clean up 183: i += x.length 184: s << x 185: end 186: end 187: @comment = cmnt 188: @seq = Bio::Sequence::Generic.new(s.join('')) 189: end 190: end 191: @seq 192: end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fasta.rb, line 232 232: def to_biosequence 233: Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat) 234: end