Chapter 26. UniSyn Waveform Generator

Table of Contents
Overview
Functions

Overview

UniSyn is a pitch synchronous concatanative waveform synthesizer. UniSyn supoorts a number of signal processing techniques and has a reasonable level of low level access allowing new algorithms to be easily added.

In the simplest sense UniSyn takes a set of waveforms and concatenates them into a single larger waveform. In this process, the duration and pitch of the original can be modified by signal processing. shows the basic flow chart of the UniSyn synthesizer.

Input

Previous modules must have filled the following relations for UniSyn to work:

Unit. A list relation in which each item represents a unit taken from a speech database. Each item has two features sig and coefs. Sig is a EST_Wave object and contains the waveform for the unit. Coefs is a EST_Track object and contains the signal processing coefficients for the unit.

Different signal processing techniques have different requirements for these features. For example, in the standard residual excited LPC, sig contains the residual and coefs contains the LPC coefficients. In a pure time domain technique, sig contains the speech waveform and coefs has 0 channels - however coefs must always be present as its time array contains the pitchmarks for the unit.

The units can be of arbitrary size - often they are diphones but in non-uniform unit synthesis they can be sub-phones, phones, words or phrases.

Segment. Segment is a list relation that contains linguistic units of arbitrary size. Although Segment often comprises phones, it can also comprise words, syllables or any other unit. The Segment relation's only purpose in UniSyn is to define a duration mapping between the concatenated waveform and the synthetic waveform. It does this by the use of two features, source_end and target_end. Source_end marks where a segment ends in the concatenated units. Target_end marks the point in the synthesized waveform where this segment should end.

In this way, these two variables control how much longer or shorter a portion of speech will be in the final utterance than in the concatenated version.

F0. A relation containing a single item, having a feature "f0" whose value is a F0 track. This track is the target F0 contour.

Concatenation

The Concatenation module takes the separate units in the Unit stream and adds them into single item in a new stream called SourceCoef. SourceCoef has a single item, and that item contains all the pitchmarks, coefficients and windowed frames for the whole utterance.

First, a new set of source pitchmarks are created by concatenating the pitchmarks of the units. If signal processing coefficients are present, they are copied into the channels which correspond to the time positions of the tracks. This information is stored in a single track which is kept as a feature called coefs in SourceCoef's single item.

Secondly, using the pitchmark positions in the units, the waveforms in each unit are windowed into a series of separate frames. These are stored as a vector of waveforms in a feature called frame in the single item in SourceCoef.

Target Pitchmarks

The SourceCoef relation serves as a set of source pitchmarks, that is, an indication of the position of every frame in the concatenated source speech. The TargetCoef relation on the other hand, serves as a set of target pitchmarks, specifying how many frames of speech should be in the output and where they should be positioned.

The function F0_to_pitchmarks takes the target F0 contour and constructs a zero channel track conatining the time positions of the target pitchmarks. This is then kept in a feature called coefs in a single item in the TargetCoef relation.

Mapping

Given a set of source pitchmarks (from SourceCoef), target pitchmarks (from TargetCoef) and source and target end points (from the source_end and target_end features on items in the Segment relation), it is possible to calculate a mapping between the source and target pitchmarks. This mapping simply states, for every pitchmark in the target, which pitchmark in the source should be used to create the speech at that point.

As such, the mapping is simply a vector of integers in which there is a cell for every frame in the target pitchmarks, whose value is a cell in the source set of pitchmarks. Frames are effectively duplicated to stretch a portion of speech and skipped to shorted a position of speech.

Signal Processing

By using the mapping function, the signal processing module takes selected frames from SourceCoef, positions them according to TargetCoef and using signal processing combines them into a single waveform.