Model didn't import: ljspeech-jets-onnx

FYI we have this in our backlog to fix, known as issue 27.

Hey there - A fix for the ConvT 1D case support was included in 1.1.0… sorry I forgot to post here until now! Check it out and let us know if any improvement. I will mark this thread resolved for now though.

I’m just getting started and trying to use the same model but I’m getting a KeyNotFoundException when selecting it.
This does not happen with the sample models.

I just tried to import the model locally, unfortunately it seems to contain an “If” operator.
The current version of Sentis does not support this operator. You can see a list of all of the operators we support and don’t support here:
https://docs.unity3d.com/Packages/com.unity.sentis@1.1/manual/supported-operators.html
I believe the other errors you are seeing are just following on from the model not being able to import. We will aim to make our debug errors clearer (issue 114 internally).

Thanks for the feedback and link.
Would you also suggest this Inspect ONNX Model Operators to see which operators are used in a ONNX model?

Yes, we also recommend https://netron.app/ for examining .onnx models outside of Sentis.

1 Like

Side note you actually get the correct error when its imported at the beginning

Take a look at the end of the model, you can see what the if are doing.


It’s super trivial behaviour. I’d just remove them from the model directly either by modifying the onnx or doing that in sentis.
You can follow the CustomLayer sample and either make the if a No-op or do the shape logic it’s doing.
You can then choose if you remove those last 10 nodes are not that useful for inference

1 Like

Thanks for this information and the push in this direction.

I saw that they are at the end and for a second I was wondering if I could remove them. But as I never worked with this I thought it can be that “simple”.

With 10 nodes you mean everything after the Conv, right?

Yes, I mean remove everything after the Tanh layer
More details

foreach(var layer in model.layers)
{
if(later.name == "...")
continue;
 newLayers.Add(layer);
}
model.layers = newLayers
model.outputs = new [] {...}
1 Like

Thanks for your help it was verry valuable to get started! I was able to get it working and can create an output.wav but it’s kind of gibberish. This may be due to missing tokenization of the input. However, I think there is more wrong.

What might be interesting for the muse team. I created a lot of the code with help of Muse and documented the in and output.

I’ve documented my results at https://github.com/mrwellmann/Sentis-TTS-Test.git it will stay private as long as we are in closed beta but I can give you access if there is interest. The c# files, the output.wav and the readme.md can also be found here https://drive.google.com/drive/folders/1eSQ7TTuDR0AIIszq8MkPLSS1_j48BwXw?usp=sharing

You are on the right track. Yes, the output will sound wrong unless you put the right input in. The input is a list of phonemes IDs. So you will need a way to convert text to a list of phonemes. I’m not sure if I will share my c# code at this time. But I can confirm that the model works in Sentis and will sound like it does in the above video. (Also don’t want to spoil your fun in working it out for yourself!) If you put in the wrong input it should sound like a garbled voice. If the output just sounds like noise then something bigger is wrong.

P.S. I listened to the wav and can confirm this is the correct output but wrong input. Good luck. :+1:

1 Like

Thanks for the feedback! This gives me a new push :smile:
So basically I have to find a recreate or import a tokenizer like this GitHub - neuml/ttstokenizer: Tokenizer for Text to Speech (TTS) models

1 Like

Yeah. (Or something that does the same thing). In fact, if I remember rightly, you can run that python code and it will give you the input ids you need I believe.

BTW here it is linked up to a character and a (very small and bad LLM) all running on the PC:

It’s my version of the ORB demo :laughing::alien:

Be interesting to do some audio post-processing to get different pitch voices etc.

Edit BTW the lip-synching is not an AI model, it’s just one of many lipsync plugins out there.

2 Likes

Thanks for the video :slightly_smiling_face:

I went down the road of using the python scripting package and actually got something it working with the neuml ttstokenizer.

The output for “Hello World! I wish I cloud speak.” looks like “[26 2 8 34 16 38 8 5 72 32 16 12 35 32 10 8 42 5 6 17 27 10 33]”

But stills sounds wrong output_v2.wav - Google Drive
San you confirm me that the out put looks right and I just have some wrong settings or something?

Probably using the wrong list of tokens. Take a look in here:

Gives you a list of the tokens used. I think the list is somewhere else too but I forget.
Try using this: GitHub - neuml/ttstokenizer: Tokenizer for Text to Speech (TTS) models
but with tokens set to this list e.g. tokens = ['<blank>','<unk>','AH0','N',...]
I think it should work. See if you get different values for the tokens at least.
P.S. Might want to speed up the wav to 44Hz.

1 Like

Hi MrWellmann,

I started today with Sentis and as I am interested in TTS your repo was a nice starting point. A big thanks for making it public.

Not sure if you are still interested in the issue you had with the tokenizer? Anyways, the gibberish bug was caused by replacing the empty string by a “0” instead of removing it from the array after splitting the string. By replacing it you added another token which got interpreted as… whatever?? I used the changes suggested by Bearman and changed this lines in your TextToSpeech class:

        // Convert input text to tensor.
        var tokenizedOutput = tokenizerRunner.ExecuteTokenizer(inputText);
        var tokenList = tokenizedOutput.Split(' ').ToList();
        for (int i = tokenList.Count-1; i >=0 ; i--)
        {
            if (tokenList[i] == "")
            {
                tokenList.RemoveAt(i);
            };
        }
        int[] inputValues = tokenList.ToArray().Select(int.Parse).ToArray();

Works fine now.
Thanks again for sharing your code. It is a great learning resource :+1:

1 Like

Hi Pepe-Hoschi thanks for your fix. I’ve been very silent as there are a lot of other things taking my time. I’m still interested and will implement your suggestion as soon as I find some tinkering time.

If you are looking for a ready to deploy offline TTS solution you might want to try Overtone Overtone - Realistic AI Offline Text to Speech (TTS) | Generative AI | Unity Asset Store. I’m not affiliated with them and did not even buy it but it is what my research turned out to be the best easy to implement offline solution 3 moths ago.

2 Likes

Hi, a tip if you don’t want to use python to do the tokenization into phonemes. Just download a dictionary of all the words already tokenized into phonemes:
Some links are given here:
That’s what I used for my example. It’s only about 30Mb. And you could probably compress it further.

That gives you most words. Unfortunately it is not perfect as a word like “bow” an have two different pronunciations.

Then you can try to see if it works on a mobile phone.

2 Likes

Hello, we now have this model and sample code on Hugging Face Feel free to try it out, suggest improvements etc.

2 Likes