How to use Logits output type?

Hi, I’m trying to do speech to text using this model

The output shape of this model is shown in Unity as logits index : 2598
Would you be able to guide me how to use this output and obtain string result?
I’m using Unity 6000.0.23f1 and Sentis 2.1.0.

Thank you for help!

1 Like

Looking at the code,

inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
        logits = model(inputs.input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)

you can see that the magic of converting too and from tokens to string is happening in the processor.batch_decode

1 Like

Then digging into what that processor.batch_decode does lead me to

(you can follow the path by going on transformers/src/transformers/models/wav2vec2/processing_wav2vec2.py at 96f67c068b43ef209f1d230d2eda4f1ab27b7550 · huggingface/transformers · GitHub ect…
But at the end of the day it looks like you have to re-code that convert_tokens_to_string
Which uses a dictonary that is loaded in https://github.com/huggingface/transformers/blob/96f67c068b43ef209f1d230d2eda4f1ab27b7550/src/transformers/models/wav2vec2/processing_wav2vec2.py#L33

1 Like

I know it seems like a lot of work, but it boils down to reading from a dictionary, you can check out our whisper/phi2 example on hugging face for help.
Let me know if you have any questions!

1 Like

Ah yes I see that there is vocab.json which is basically the dictionary for converting int to letter.
Thank you for the quick reply and detailed explanation!

1 Like

Yes!
Do double check what dictionary is being used in the wav2vec2 tokenizer right. It could be the same or could be a bit different. So worth digging into it to see if there is a diff

2 Likes

I successfully ran the model.
Thank you for your help!

1 Like