View on GitHub

AutoConvert

Audio Processing and Indexing

This page contains audio samples generated by our voice style transfer framework, created for our final project for the Audio Processing and Indexing 2021 course at leiden university. For this project we investigate voice style transfer using the AutoVC framework.

The table below contains conversion samples between speakers from the VCTK dataset. We define the following elements:

Vocoders:

AutoVC Models:

The baseline model, provided by the authors, is AutoVC+WaveNet.

We notice that AutoVC + WaveNet performs well on short samples with seen speakers. However, its conversion times are very long and its performance is very bad when the target speaker is unseen (sample p225 →Wouter AutoVC+WaveNet). Furthermore, the performance on longer audio-samples is quite bad (sample p226 → AutoVC+WaveNet) due to it being trained on samples of circa 2 seconds.

A solution to the 2-second limitation is the introduction of chunking, which divides the audio sample into chunks of ca. 2 seconds (see p226 → Wouter - AutoVC + WaveNet (chunking)), although this might introduce some ‘choppyness’ when cutting is done in the middle of sentences due to the context loss.

A first solution to the slow conversion time was to replace the WaveNet vocoder by a Griffin-Lim algorithm. This significantly improved conversion speed, at the cost of audio quality. We aimed to improve this by introducing the Multiband MelGAN model.

The New AutoVC + MelGAN model was retrained on the full VCTK dataset, using the MelGAN-spectrogram format and longer-sized audio samples. This improves audio quality significantly compared to the Griffin-Lim algorithm, while reducing the Vocoder processing time by 99,96% compared to (AutoVC +) WaveNet.

We encourage readers to try the tool provided in the github repository to generate their own samples.

Short Samples

Source Speaker Target Speaker Results
p225 (Female)
p225 (Female)
AutoVC + WaveNet (Baseline)(320.76s)
AutoVC + Griffin (1.19s)

AutoVC + MelGAN (1.07s)

New AutoVC + MelGAN (0.80s)
  p226 (Male)
AutoVC + WaveNet (306.57s)

New AutoVC + MelGAN (1.08s)
  Wouter (Male) - Unseen
AutoVC + WaveNet (313.75s) AutoVC + MelGAN (1.31s)

Longer Samples

With longer input samples, the vanilla AutoVC model’s output was scrambled after a few seconds. This issue was solved by dividing the input audio into chunks and processing those sequentially.

The vanilla AutoVC + WaveNet model also performed very bad in zero-shot scenarios with unseen targets (see p226 → Wouter) due to the model being trained on relatively short audio samples of circa 2 seconds.

The New AutoVC + MelGAN model was trained on the full VCTK dataset, using longer audio-samples, which eliminated the need for chunking and enabled the model to function much better for unseen target speakers.

Source Speaker Target Speaker Results
p226 (Male)
p225 (Female)
AutoVC + WaveNet (no chunking) (1039.64s)
AutoVC + WaveNet (chunking) (1040.37s)
New AutoVC + MelGAN (2.10s)
  Wouter (Male) - Unseen
AutoVC + WaveNet (905.23s)
New AutoVC + MelGAN (1.92s)

Unseen Source

Although the style of the unseen source speaker seems to transfer quite well for the AutoVC + WaveGAN model when considering these 2 examples, this comes at the cost of clearity. We also believe this model to be overtrained on these examples as this model was trained on very few speakers (about 40), 2 of which were p225 and p226.

For unseen source speakers, the AutoVC + MelGAN combination manages to generate decent samples in a fraction of the time needed by the WaveGAN vocoder, although the results are still far from perfect.

Some target speakers seem to work better than others, this might have to do with the resemblance of the source speaker to the target speaker.

Source Speaker Target Speaker Results
Wouter (Male)
p225 (Female)
AutoVC + WaveGAN (362.44s)
New AutoVC + MelGAN (1.390s)
  p226 (Male)
AutoVC + WaveGAN (369.64s)
New AutoVC + MelGAN (1.152s)