Creating Voicebanks in Z Idol Talker

Creating your own voicebanks for Z Idol Talker is easy! Here is a brief guide on voicebank creation, however, I recommend checking the official documentation for Piper in the event anything has changed on that front in the time since writing this up. If you already have a piper model recorded skip to step 5 for information on formatting the voicebank for use in Z Idol Talker.

Step 1: Recording

To start out, record 30 minutes to an hour of data of you/your voice actor speaking in ONE of the supported languages (Multi-lingual models are not supported). The more data you record the better (for reference, first party voicebanks currently only contain 20 minutes at best). Recordings should be either 16000hz or 22050hz, but if you mess this up the notebooks can resample the audio for you. You are able to record multiple speakers for a single language, however currently I advise against it as they’re usually lower quality than two separate voicebanks. Save each sentence as it’s own file, and give it a name you can reference easily.

As an example, our voicebanks are recorded in Oremo, with each file being named after the sentence said with all punctuation removed. This method of file naming may not work for languages other than Japanese and English in OREMO specifically. For reasons explained in Step 2, you may also find it beneficial to save each file as a number, and put the plain text as a comment file.

Step 2: Labeling

Labeling of Piper models is pretty simple, create a basic .txt document that contains transcriptions of what is being said in each audio file, with appropriate pronunciation (we’ve been referring to these as hummus internally, if you see the word hummus anywhere it’s referring to this)(And if you see crumbus anywhere that’s what we’re calling the reclists). This file will then be saved in the following format: wavs/[FILENAME].wav|[TRANSCRIPTION]. For multi-speaker models, save like this wavs/[FILENAME].wav|speaker[NUMBER OF SPEAKER]|[TRANSCRIPTION], with [NUMBER OF SPEAKER] starting at 1 and then going up.

The easiest way to go about transcription is using tools to append things onto the front and ends of the transcriptions, using a site such as Pinetools (I’m thinking of making a tool to make this easier if you use the method we used for most voicebanks). If you named your files after numbers, this website can also increment what you append for you, making the process way simpler, by simply appending the following to the start of each line wavs/%N%.wav|

Below is an example of a transcription file with numbered wavs and how you could generate that instantly with the above methods from a plain sentence comment list.

Step 3: Training

Use the following Colab notebook from the piper repository to train your voicebank. An hour of training will eat about 4 credits based on my experience. As the notebook is subject to change, this section contains mostly advice and things that may go wrong.
One of the biggest things to note is that some languages DO NOT have pretrained models, and certain languages train with a different language code than the pre-trained models. By adding the following cell, you can train from a checkpoint you link to via google drive instead of the ones Piper provides you. You can link to any model checkpoint regardless of language and train in the one you want to begin training instead. This is especially useful as some languages only have CC SA models as pre-training options which will you may want to avoid using if you do not want to license your model under Creative Commons Share-Alike.

#@markdown # <font color="ffc800"> **4.5 Pretrain a different model**
#@markdown ### Use if you don't want to use one of the default ones:
modelurl = "" #@param {type:"string"}
!gdown -q "{modelurl}" -O "/content/pretrained.ckpt" --fuzzy

Here is a link to all of the pre-trained models you can use with piper SOME OF THESE ARE SHARE-ALIKE MODELS. To see how the model sounds, and the license it’s distributed under, check out the model sample page. Z Vocal Project is working on our own corpus for English training to ensure the process is fully ethical, but it’s a ways away. (Shoot us an email if you’d like to donate some talking samples to us)

Here is a link to espeak language codes. Not all of these are compatible with Z Idol Talker/Piper but if a language is not officially supported it may be worth looking into. For example, if you want to try training a Japanese voicebank, modify Step 3’s cell as shown below and select Japanese when training. (That said, for Japanese you are better off training a Japanese voicebank as a Spanish voicebank with romaji inputs.) (As I will mention later, you can achieve cross-lingual but Japanese is still awful)

Step 4: Export

Before export, test your voicebank as a checkpoint using the following notebook, as export to onnx will take some time, and you’ll probably want to ensure your voicebank sounds the way you want it to first.

Once you’re sure you want to export, use the following notebook to begin exporting as ONNX. When finished, remember to save to your google drive, and then download it to your computer from there.

Step 5: Z Idol Talker Formatting

The following is the BARE MINIMUM that needs to be present in your voicebank’s folder for it to work. If any of these files are missing in a folder in your voicebank folder the program will not operate until this is fixed. The files needed are:

  • banner.png
    • a 400x100px image containing the banner image for your character
    • If you would like your voicebank to have lip-flap, provide a variant named banner_open where the character’s mouth is open.
  • icon.png
    • a 100×100 image containing the icon for your character in voicebank selection
  • character.txt
    • Crucial set-up information for your voicebank (More in the next paragraph)
  • license.txt
    • The license for your voicebank
  • the onnx file of your model
  • the json file of your model

Character.txt should contain this and nothing more. Failure fill out a field correctly can result in the program erroring

NAME=Your Voicebank's Name
LANG=The language your voicebank speaks in
COMPANY=The company/creator behind the voicebank
TYPE=Mainly gender, which affects the color of some things, however there are special colors and also literally anything can go here.
COLOR=r,g,b(The RGB values of the color of your voicebank, this tints every UI element you do not customize)(slashes not part of this)
MULTIPLEVBS=false(Whether or not the voicebanks is multi-speaker.)
Bio=Your character's bio.

With all these files in hand, you can now launch Z Idol Talker and start using your voicebank.

Step 5.1: UI Customization

Customizing the UI is as simple as placing an image of the right dimensions and right file name in the folder with your voicebank. A template with all the info you need is available on the download page. (If not updated soon, be aware that bg2.png is not real)

To explain some things that may not be immediately clear: logo.png will replace your voicebank’s name in the panel. mystery.png is an image that goes over the voicebank’s banner, being a plate of corn by default. It’s just a fun thing you can modify if you’d like, with the intention being to make your character hold an item.

Step 5.2: Idolgotchi Customization

Idolgotchi is a feature allowing users to feed their voicebanks. You as a voicebank creator can customize the foods, and or the dialogue for each food.
To modify the foods available create an 80×80 png file with your food and title it either “Sweet.png”, “Sour.png”, “Bitter.png”, “Salty.png”, or “Savory.png”. These names match up with the order they appear, and the default dialogue spoken if no custom dialogue is set. We currently only support these five foods, but may expand the system if there is demand.

To modify the dialogue, create a file named food.txt and create a line of dialogue for each food. Line 1 is sweet, Line 2 is Sour, Line 3 is Bitter, line 4 is Salty, Line 5 is Savory. These lines can be entire paragraphs provided they stay on the same line. There are probably some limitations but we have not encountered them yet. If you make this file, make sure you write all 5 lines or the program will crash.

Step 5.3: Achieving Crosslingual

Piper/Z Idol Talker somewhat supports crosslingual. Not all languages are compatible and there WILL be an accent. Do not bank on your voicebank being compatible with every language, and try experimenting with voicebanks native in the language you will record, and then seeing how they sound with their language modified.

Here are the espeak language codes, the second column is what you’re looking to use. These language codes cover languages and accents.

To modify the language of a voicebank, open it’s .json file and change the espeak voice variable, along with the language code variable. For example if we change Dunder French into a Korean or Japanese voicebank it looks like this. Keep in mind you will need to reload the voicebank after modifying the json.

Here is an example of Dunder French speaking in a few cross languages, some are better than others. (I’m choosing Dunder French because he has the least data of any voicebank)

Dunder French with original language code “Si peut-être tu sens que je me suis réveillé avec toi, souris simplement”
Dunder French with korean language code “가 너와 함께 일어났다고 느끼면 그냥 웃어”
Dunder French with English language code “If perhaps you feel I woke up with you just smile”
Dunder French with K’iche’ language code “Wa xaan sientes in desperté ta wéetel, chéen sonríe” (i do not know if this is the right language)
Dunder French with Welsh language code “Os efallai eich bod yn teimlo fy mod wedi deffro gyda chi yn unig gwenu”
Dunder French with Japanese language code “もし あなた と いっしょ に め が さめた と かんじたら、 ただ わらって ください”

To expand on Japanese a bit more, espeak cannot parse Kanji and you will need to write all your inputs entirely in hiragana. As recommended prior, I highly suggest for Japanese voicebanks you train the voicebank on Spanish with romaji and just use that instead of crosslingual. Put as simply as possible, here’s a native spanish voicebank speaking romaji and then with crosslingual. (I unfortunately deleted the native jp spanish ones I had, I will update this guide if I make a new one)

Unnamed Spanish Voicebank with spanish language code “moshi anata to issho ni me ga sameta to kanjitara、 tada waratte kudasai”
Unnamed Spanish Voicebank with japanese language code “もし あなた と いっしょ に め が さめた と かんじたら、 ただ わらって ください”

As you can hopefully see, espeak is just really bad with Japanese. It’s like this for native jp banks too I swear.

Now you’re finished!

Unless I have left out something major, I’ve told you everything you need to know about voicebank creation. I will make a video guide soonish maybe.