merger

 

Function

Merge two overlapping nucleic acid sequences

Description

This joins two overlapping nucleic acid sequences into one merged sequence.

It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences and then it creates the merged sequence from the alignment. When there is a mismatch in the alignment between the two sequences, the correct base to include in the resulting sequence is chosen by using the base from the sequence which has the best local sequence quality score. The following heuristic is used to find the sequence quality score:

If one of the bases is a 'N', then the other sequence's base is used, else:

A window size around the disputed base is used to find the local quality score. This window size is increased from 5, to 10 to 20 bases or until there is a clear decision on the best choice. If there is no best choice after using a window of 20, then the base in the first sequence is used.

To calculate the quality of a window of a sequence around a base:

N.B. This heavily discriminates against the iffy bits at the end of sequence reads.

This program was originally written to aid in the reconstruction of mRNA sequences which had been sequenced from both ends as a 5' and 3' EST (cDNA). eg. joining two reads produced by primer walking sequencing.

Care should be taken to reverse one of the sequences (e.g. using the qualifier '-sreverse2') if this is required to get them both in the correct orientation.

Because it uses a Needleman & Wunsch alignment the required memory may be greater than the available memory when attempting to merge large (cosmid-sized or greater) sequences.

The gap open and gap extension penalties have been set at a higher level than is usual (50 and 5). This was experimentally determined to give the best results with a set of poor quality EST test sequences.

Usage

Here is a sample session with merger.

% merger
Merge two overlapping nucleic acid sequences
Input sequence: tembl:eclacy
Second sequence: tembl:eclaca
Output sequence [eclacy.fasta]: 
Output alignment [eclacy.out2]:
                                                                   

Typically, one of the sequences will need to be reverse-complemented to put it into the correct orientation to make it join. For example:

% merger file1.seq file2.seq -sreverse2 -outseq merged.seq -outfile stdout

Command line arguments

   Mandatory qualifiers:
  [-seqa]              sequence   Sequence USA
  [-seqb]              sequence   Sequence USA
  [-outseq]            seqout     Output sequence USA
  [-outfile]           align      Output alignment and explanation

   Optional qualifiers:
   -datafile           matrixf    This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -gapopen            float      Gap opening penalty
   -gapextend          float      Gap extension penalty

   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-seqa]
(Parameter 1)
Sequence USA Readable sequence Required
[-seqb]
(Parameter 2)
Sequence USA Readable sequence Required
[-outseq]
(Parameter 3)
Output sequence USA Writeable sequence <sequence>.format
[-outfile]
(Parameter 4)
Output alignment and explanation Alignment file  
Optional qualifiers Allowed values Default
-datafile This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-gapopen Gap opening penalty Number from 1.000 to 100.000 50.0
-gapextend Gap extension penalty Number from 0.100 to 10.000 5
Advanced qualifiers Allowed values Default
(none)

Input file format

Output file format

The output sequence file contains the joined sequence, by default in FASTA format. Where there is a mismatch in the alignment, the chosen base is written to the output sequence in uppercase.

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/AlignFormats.html for further information on alignment formats.

The output report file contains descriptions of the positions where there is a mismatch in the alignment and shows the alignment. Where there is a mismatch in the alignment, the chosen base is written in uppercase.

########################################
# Program:  merger
# Rundate:  Mon May 20 16:17:43 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: ECLACY
# 2: ECLACA
# Matrix: EDNAFULL
# Gap_penalty: 50.0
# Extend_penalty: 5.0
#
# Length: 3173
# Identity:     159/3173 ( 5.0%)
# Similarity:   159/3173 ( 5.0%)
# Gaps:        3014/3173 (95.0%)
# Score: 795.0
#
#
#=======================================

ECLACY             1 ttccagctgagcgccggtcgctaccattaccagttggtctggtgtcaaaa     50

ECLACA             1                                                         0
 

.................... until ......................


ECLACY          1301 cgcttagcggccccggcccgctttccctgctgcgtcgtcaggtgaatgaa   1350
                                                              |||||||||
ECLACA             1                                          gtgaatgaa      9

ECLACY          1351 gtcgcttaagcaatcaatgtcggatgcggcgcgacgcttatccgaccaac   1400
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLACA            10 gtcgcttaagcaatcaatgtcggatgcggcgcgacgcttatccgaccaac     59

ECLACY          1401 atatcataacggagtgatcgcattgaacatgccaatgaccgaaagaataa   1450
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLACA            60 atatcataacggagtgatcgcattgaacatgccaatgaccgaaagaataa    109

ECLACY          1451 gagcaggcaagctatttaccgatatgtgcgaaggcttaccggaaaaaaga   1500
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
ECLACA           110 gagcaggcaagctatttaccgatatgtgcgaaggcttaccggaaaaaaga    159
 

.................... until ......................

 
#---------------------------------------
#
# ECLACY position base          ECLACA position base    Using
#
#
#---------------------------------------         

Data files

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It exits with a status of 0

Known bugs

None.

See also

Program nameDescription
consCreates a consensus from multiple alignments
megamergerMerge two large overlapping nucleic acid sequences

Author(s)

This application was written by Gary Williams (gwilliam@hgmp.mrc.ac.uk)

History

Written (Gary Williams) 1999

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments