Speech recognitation. Any ideas?

Greetings! Has anyone encountered the task of recognizing speech to text, and how did you manage to solve it? I mean streaming recognition, like “Said - saw”? I searched for assets, but found only NON-Streaming Speech.

After some time, I realized that you can perform Streaming Recognitation using Watson IBM Cloud (they have an excellent project on GitHub for Unity). I strongly advise using it if you need to recognize the speech in English, but unfortunately it does not support the language I need.

I had an idea using Google Cloud recognition, but it will be based on NON-Streaming and will require some knowledge, and if you have such, I kindly ask them to share))
The idea is as follows: 1) Record the level of silence in the room to determine the boundary for a pause in the recording.
2) After reaching a pause while recording (for example, if a person is silent for 0.5 seconds), copy the already recorded audio clip into a separate audio clip and send it to recognition.
For the realization of the idea, it is necessary:

  1. To know how many decibels the given moment records a microphone.
  2. Understand how you can record in two audio clips,
    while the former must continue to record, and the second must stop at the right moment (for sending to recognition).
    Or maybe there is a way to copy “already recorded” to another audio clip from the streaming audio clip without stopping the recording.
    Any suggestions?)

If neither 1 nor 2 is possible, then just tell me about it please)))