Hi, I’m trying to do speech to text using this model
The output shape of this model is shown in Unity as logits index : 2598
Would you be able to guide me how to use this output and obtain string result?
I’m using Unity 6000.0.23f1 and Sentis 2.1.0.
I know it seems like a lot of work, but it boils down to reading from a dictionary, you can check out our whisper/phi2 example on hugging face for help.
Let me know if you have any questions!
Ah yes I see that there is vocab.json which is basically the dictionary for converting int to letter.
Thank you for the quick reply and detailed explanation!
Yes!
Do double check what dictionary is being used in the wav2vec2 tokenizer right. It could be the same or could be a bit different. So worth digging into it to see if there is a diff