My synthetic vocalist: Dreamtonics Synthesizer V

I find it strange to be able to say that I’ve now created several songs that use a synthetic vocalist. This is a somewhat weird concept, but it’s right at the bleeding edge of music technology. We’ve had voice synthesis for years – I remember using a Texas Instruments “Speak & Spell” when I was small in the 1970s, and it’s gradually got better ever since. The first time I ever heard a computer trying to sing (I’m not counting HAL singing “Daisy, Daisy” in “2001”) was in a Mac OS app called VocalWriter, released in 1998, which automated the parameter tweaking abilities of Apple’s stock voice synthesis engine to be able to alter pitch and time well enough for it to be able to sing arbitrary songs from text input. It still sounded like a computer though. A much better “robot singer”, released in 2004, was Vocaloid, but even then, it still sounded like a computer. A Japanese software singer called UTAU, created in 2008, was released under an open source license, and this (apparently) formed the basis of Dreamtonics’ Synthesizer V (SV), which is what I’ve been using. SV finally crosses the threshold of having people believe it’s a real singer.

The entry of my song in the 2024 Fedivision song contest sparked quite a bit of interest. I posted a thread about it on Mastodon, and I wanted to preserve that here too. One commenter said “I thought it was a real person 😅” – which is of course the whole point of the exercise!

SV works standalone, or as a plugin for digital audio workstations (DAWs) such as Apple’s Logic Pro, or Steinberg’s Cubase, and is used much like using any other software instrument. It doesn’t sing automatically; you have to input pitch, timing, and words. Words are split into phonemes via a dictionary, and you can split or extend them across notes, all manually.

Synthesizer V’s piano roll editor

In this “piano roll” editor you can see the original words inside each green note block, the phonemes they have mapped to appear above each note, an audio waveform display below, and the white pitch curve (which can be redrawn manually) that SV has generated from the note and word inputs. You would never guess that’s what singing pitch looks like!

For each note, you have control over emphasis and duration of each phoneme within a word, as well as vibrato on the whole note. This shot shows the controls for the three phonemes in the first word, “we’re”, which are “w”, “iy”, “r”:

The SV parameters available for an individual note, here made up of three separate phonemes

This note information is then passed onto the voice itself. The voice is loaded into SV as an external database resource (Dreamtonics sells numerous voice databases); I have the one called “Solaria”. Solaria is modelled on a real person: singer Emma Rowley; it’s not an invented female voice that some faceless LLM might create from stolen resources. You have a great deal of control over the voice, with lots of style options (here showing the “soft” and “airy” modes activated). Different voice databases can have different axes of variation like these; for example a male voice might have a “growly” slider:

SV voice parameters panel
Synthesizer V’s voice parameters panel

There are lots of other parameters, but most interestingly tension (how stressed it sounds, from harsh and scratchy, to soft and smooth), and breathiness (literally air and breath noise). The gender slider (how woke is that??) is more of a harmonic bias between chipmunk and Groot tones, but the Solaria voice sounds a bit childish at 0, so I’ve biased it in the “male” direction.

The voice parameters can’t be varied over time, but you can have multiple subtracks within the SV editor, each with different settings, including level and pan, all of which turn up pre-mixed as a single (stereo) channel in your DAW’s track:

Multiple tracks in the SV editor
Multiple tracks in the SV editor

In my Fedivision song, I used one subtrack for verses, and another for chorus, the chorus one using less breathiness and trading “soft” mode for some “passionate” to make it sound sharper and clearer.

This is still all quite manually controlled though – just like a piano doesn’t play things by itself, you need to drive this vocalist in the right way to make it sound right.

Since the AI boom, numerous other ways of getting synthetic singing have appeared, for example complete song generation by Udio is very impressive, but it’s hard to make it do exactly what you intended; a bit like using ChatGPT. Audimee has a much more useful offering – re-singing existing vocal lines in a different voice. This is great for making harmonies, shifting styles, but only really works well if you already have a good vocal line to start with – and that happens to be something that SV is very good at creating. I’ve only played a little with Audimee; it’s very impressive, but lacks the expressive abilities of SV; voices have little variation in style, emotion, and emphasis, and as a result seem a little flat when used for more than a couple of bars at a time. Dreamtonics have a new product called VocoFlex that promises to do the same kind of thing as Audimee, but in real time.

All this is just progress; we will no doubt see incremental improvements and occasional revolutions, and I look forward to being able to play with it all!

Federation – my Fedivision Song Contest entry

I happened across the Fedivision Song Contest on Mastodon. I love things like this, though I’ve never before felt in a position to enter such a thing – but here I am. So here’s my effort. The song is called “Federation”, right on topic. Hit play below:

From around 1990 (yes, before the web existed!), I frequented usenet newsgroups like rec.music.synth, and the people there (some from Team Metlay, including Nick Rothwell) were very helpful when I was trying to build synthesisers, samplers, and effect processors as part of my degree course. The same people organised a CD compilation called “Musenet 1992”. I was intrigued by the practical logistics involved – there were version control problems, and lots of physical mailing of floppies going on; a CD-ROM burner cost thousands, so they needed to raise funds to get a real CD pressed. I paid whatever they were asking at the time (which I recall involved using telnet to cdbaby, one of the first online stores, ever), and a couple of months later, I received my double CD.

Listening to it now, I’m still impressed by the quality of some of the entries, in an era that pre-dated digital recording technology. I also love the more loopy entries, especially Mark Wheadon’s “One more hack”, which remains topical.

The Fedivision Song Contest is in much the same vein, though one key difference is that there is actually a theme – the fediverse itself.

In case you’re unfamiliar with it, the fediverse is an umbrella term for services that are (or can be) self-hosted, and connected to other similar instances through a set of common federated communication protocols. It’s frequently held up as a more democratic alternative to monolithic social networks like Facebook and Twitter. It has parallels with the rise of interconnected bulletin boards in the 1980s – little islands of civilisation (or maybe not!) talking to each other, eventually coalescing into what we now think of as the internet. The fediverse is a far more ambitious, bigger, faster, more dynamic return to that ideal. Instead of an individual, a university, or a government toeing the line of some faceless corporate monstrosity (that would be you, Facebook), each of these entities can set up their own instance of, for example, Mastodon (a bit like Twitter, but without the evil dickhead in charge), manage it exactly as they deem appropriate, and connect it to the myriad other Mastodon instances so they can all talk to one another, you know, social networking in its true meaning.

Anyway, such is the romance of the fediverse, that it’s been busy building its own culture, hence the appearance of this fedi-friendly song contest 4 years ago.

Federation: the song

I wanted to have a strong minor/major contrast to reflect pessimism in the current state of social networks, and the shiny, naïve optimism of the fediverse, so the verses are sad laments, but the chorus reflects hope. I took inspiration from what I was listening to at the time, which happened to be Yello‘s 2009 album “Touch”, in particular the track “You better hide”. I’ve liked Yello for decades (I hope to be as cool as Dieter Maier when I’m that age!), especially their affection for atmospheric sub bass, synths, percussion, and trumpets. I often find I’m listening to a song and think “I could write something like this”, start out copying it a fair bit, but then it gains a life of its own and heads off in unexpected directions. You can hear that in this song, where the intro section is quite Yello-ish, but then seems to have made other plans.

Verse

We’re all in this together,
at least I like to think that that’s so.
It’s getting harder to build bridges
over the sea of trolls below.
We’re feeling more like castaways
on our lonely little islands in the streams,
throwing messages in bottles into rising tides
of thoughtless indifference.

Chorus

The future lies in federation,
forging friendships from afar.
Turning islands into nations into continents;
it’s up to us to raise the bar.

The future lies in federation,
forging friendships from afar.
We need to choose our neighbours wisely, break the monolith;
It’s time to aim right for the stars.

Verse

The billionaire moderator,
the kind of guy that you don’t want to know,
bows down to the kleptocrats
and you know he won’t let it go.
We’re building ‘cross countries, near and far
a place to call home, to belong.
It’s a slow exodus, the beginning of something,
work back to where we went wrong.

Chorus

The future lies in federation,
forging friendships from afar.
Turning islands into nations into continents;
it’s up to us to raise the bar.

The future lies in federation,
forging friendships from afar.
We need to choose our neighbours wisely, break the monolith;
It’s time to aim right for the stars.

As usual, this song was built in Apple’s amazing Logic Pro X. I wanted to make sure I only used instruments that I could set up on my new MacBook M3 Pro (music software licences are notoriously strict and DRM-ridden), so it’s mostly using stock instruments, which to be fair are great. There are no audio recordings at all – everything is synthesised. The bass and big synth pad are Logic’s RetroSyn. Drums are Logic’s Drummer with the Speakeasy brush kit. The twinkly metallic chords are from Alchemy, trumpet from Studio Horns, and there’s a little Korg WaveStation for the high chimes. EQs, compressors, delays, and reverbs are stock Logic plugins.

The jewel in the crown is of course Dreamtonics Synthesizer V Studio Pro (SV) with the Solaria voice database, which sings the lead and backing vocals in a way that I never could. Many of the tracks I hear using SV are quite mechanical, doing the equivalent of quantising everything with robotic efficiency, but you can spot that, so I’ve gone to some lengths to push things away from rigid timing, trying to make it sound more natural, especially at this slow 95bpm tempo. The timing is a little tricky as the drums swing a bit (how can you not, with a brush kit?), and it was difficult to avoid having the bass sounding slow and laggy if it wasn’t swinging the same way. I just love using SV for backing vocals, as you might be able to tell.

The lyrics are somewhat earnest, worthy, and naïvely optimistic, and squarely aimed at the aspirations of the fediverse – we can all hope, right? The mentions of “islands in the streams“, and “messages in bottles” are sort-of deliberate, and you can even see a reference to “bridge over troubled water” if you squint a bit. I felt compelled to include a bit of abuse for you-know-who. I’m particularly pleased with managing to squeeze in “break the monolith”, which is something of a theme in fediverse development, though not related to Martin Fowler’s treatise. The excessive alliteration in the chorus was almost entirely accidental, honestly.

I had a play with passing an SV vocal line into Audimee, which is one of these new AI services doing freaky things with LLMs, and here the service will re-sing vocal tracks using different voices. The results are pretty amazing, but it doesn’t preserve the timbre of the original, so for example in this song, it can’t reproduce the switch from the breathiness of the verses to the stronger clarity in the chorus. That said, it is really believable. Though it won’t improve the actual singing as the results are more like altering treble and bass; imagine having a knob you could turn to switch singers, but retaining the exact pitch and timing of the original, however good or bad they might be. Feeding in generated vocals from SV works really well (they’re obviously super-clean “recordings”), and the output sounds very natural, but lacking in the variation that SV provides, so I didn’t use it – but maybe next time.

Anyway, I hope you enjoyed listening to my song.

Update: In the final results, this song placed equal 11th (out of 72 entries), with 24 votes. I’m looking forward to doing better next year!