[sdiy] Speech synthesis chip

Wed Sep 17 00:43:32 CEST 2003

Richard,

Thanks for your suggestions. Actually I already knew all the
techniques you describe, and I think that each one would produce
a different result, would have a different "sound quality"...

For instance, for a pure "singing voice" synthesis, additive synthesis
can be an attractive solution (realtime control of amp & freq of a set of
partials is quite feasable with a DSP), although I've used the CHANT
program and I know that things aren't so straightforward...
OTOH for vocal sound processing, the use of formants makes more
sense...

BTW I'm not sure that "storing/sequencing the control signals created
by the analysis bank in a vocoder will probably give better results" :
I have designed / built a 30 bands analog freq analysis that converts
each channel ouput into a MIDI ctrl signal to be stored on a sequencer,
and the data I get are somewhat different & less accurate than with
a phase vocoder... Of course, this is mainly due to the different
analysis resolution, and this is the point : using VC-BPF as formants
gives more degrees of freedom in parameter settings than a vocoder :
central frequency, bandwidth & Q can be set / modified independently
and smoothly for each band...

Hence my original question : what is the best analog design for
a VC-BPF to be used as formant ? At  first glance the CEM3350
looks like the ideal choice (because all parameters are VC and also
for the small PCB room), but it's an obsolete (and expensive) part...
So, any better suggestion ? A design around the 13700 or CA3280
perhaps ?

Any analog wizzard on this list ?

JB

>Sure...
>I actually spent some time wondering what would be the best
>choice of design for the VC BP filter bank (especially coz I'm
>not an analog wizzard)... I finally decided that the first step
>was to check if the concept would work, and decided to prototype
>that thing with a set of CEM3350 chips.

You'll find that real formants are only approximately like simple bandpass
filters. When you speak the resonances occur in different parts of the
mouth/nasal cavity at different times, and morphs between these resonaces
can contribute as much to the character of the sound as the location of the

resonances.

A more fertile approach would be to sample various people talking, chop out

the phonemes and cross-synthesise those digitally with an input signal to
create a talking result.

There's also a technology called Linear Predictive Coding (LPC) which is
used to model filter characteristics in speech-related systems.

Alternatively just use a vocoder - it's much quicker. ;-)

Seriously, storing/sequencing the control signals created by the analysis
bank in a vocoder will probably give you better results than any other
analogue technique.

Richard