Back in the day, I had a couple of "voice boxes" for various old computers. I had a Cheetah Sweet Talker for my ZX Spectrum, and later on, a Blue Alpha Voicebox for the SAM Coupe. Both were great fun to play around with, but if you wanted to include the speech in a game, the person loading the game would need one of these units attached to their computer for the speech to work.
These voice boxes were most definetly a commercial failure, especially with regards to commercial software development to support them, but good fun in their own right. On rare occasions, bypassing the hardware addons, games may have included a very poorly sampled sound, despite the hardware limitations of the time, was still very cool.
So, I've found out that the ZX Spectrum NEXT can play recorded sounds - But with a big difference, the sounds are a better quality (higher resolution) WAV sound direct from BASIC.
So, here starts a new project, to re-create those add on voice boxes, but in software, so that generated speech can be used in games or whatever your programming project fancies. I'm hoping for a slightly more polished sound, and a variation to my voice with Maya providing the allophones for an alternative version.
Before I get started, you should know that early speech on computers sounds very robotic, as it's made up of generated or sampled sounds called Allophones, that generally lack any intonation (when a pitch of your voice rises and falls as you speak). It's a far cry from today's synthesised voices such as the likes of Alexa that speak nearly perfect "human" - Compare it more with the robotic voice of the late Dr Stephen Hawking.
I'm not after perfection, but do want it to sound as natural as possible within the simple constraints of using a simple set of allophones, I also don't want to sample whole words, this is definelty something that would create much more fluid speech, but then I'd need a huge library of words, I may as well just speak out the sentences, and where's the fun in that. I simply want to recreate a simple allophone speech system that works directly from BASIC. ideally, creating a character voice library with different allophone sets for each character (including robots). I'll leave a word library for another project.
How do allophones work?
It doesn't seem that long ago that Maya was learning to read using phonetics - She's progressed exceptionally well, with an actual reading age of 9 years, 7 months at time of writing (She's 7 years 4 months old). She would split a word up into it's component sounds, CAT = Ca Ah T (or writing the phonetics correctly: /k/, /æ/, and /t/).
Now, this is great for someone learning to read, it exaggerates the sound, helping the learning process, but if we went around saying "C-ah-t", it would sound a bit odd. Here's where the allophones come into play. Allophones are simply phonemes, but often combined to make the speech more fluid - Taking the pause out where it doesn't need to be.
If you break the word CAT up into it's spoken (not written) sounds, we get Ca with a short T at the end - Not Cah Ah T as with the phonemes - We don't split the C and A, it's a single sound when spoken.
Thanks to a computer chip (SP0256) that was used to power computerised synthesised speech, there's a full list of allophones at https://www.cpcwiki.eu/index.php/SP0256_Allophones1, but there's no reason why the principals couldn't applied to other languages with additional allophones created as required. I can't see that the Sweet Talker or Voice Box could ever speak Zulu or Xhosa with their distinctive clicks. Well, now, if this works, that would be possible!
So, for our cat, we would have CA /KK1/ and T /TT1/. We'd need to play those two allophones together to make our synthesised word. At face value, it looks quite straightforward.
Just before we get started, I should stress that I am completely aware that accents can affect how the allophones are interpreted, "American English" would have the A more stretched out, heavier R's and more of a D than a T at the end, and of course, within the USA, there's regional accents that create further subtle differences. I'm aiming for my own style of English, which would probably be described as well spoken British English, but I would love to eventualy play around with this to have different regional British English accents, and Foreign English accents. Anyone remember the BBC comedy 'Allo 'Allo in the 80's?
Getting started
Simple approach to get started and test the system:
- Record a few allophones as WAVs, and get them into the NEXT.
- Use the BASIC commands to play them together as a word.
If this works, we can then build an allophone library with more samples, and a set of allophone sounds.
Finally, we'll make a sample program to do some talking. Trying to figure out what allophones are needed for standard words will be a programming challenge in itself, so for the scope of this proof-of-concept project, we'll just run off a list of allophones to create a few sentences of synthesised spoken text.
Further investigation
- Consider different accents or voices for the sampled allophones to give characters in a game indvidual identities.
- Sample different tones to add intonation to give your synthesised voice a bit more emotion. Could this be done by running the samples through software to alter the pitch?
- Look at allophone useage, and see where there's a common combination of allophones within words, new allophones could be substituted for better clarity.
- SP0256 Allophones List CPC EU Wiki ↩︎