[currently deprecated] AudioStreamSpeechWhisper [offline speech recognition system]

: offline speech recognition, transcription, translation to English and language detection system based on originally OpenAI’s Whisper, using an efficient whisper.cpp implementation running entirely locally on user device

.either manual or automatic processing based on custom VAD (Voice Activity Detection) over audio stream (can be used in automatic ‘open mic’ fashion)

an example running in macOS Editor:

Please also see latest asset documentation
Demo builds: Windows x64 | macOS | Linux (x64) | Android/ChromeOS
Store page: Asset Store page

Initial version - w/14 days new release discount - just went live -

An update submitted, should be online hopefully shortly:

mainly fixed models downloads, a VAD bugfix, and added CoreML support for Apple Silicon:

V 1.4.7 092023 .250k

  • updated to (current) latest [1.4.7] Whisper.NET changes, also from now on following Whisper.NET versioning no.
  • updated model [HF] downloads & error handling
  • added models QuantizationType
  • updated iOS/macOS native libraries which now support CoreML
  • added automatic download of CoreML models
  • fixed VAD detection bug for open mic/continuos processing
  • see updated Docs for more about platform/macOS specific libraries

-** December '23 SALE **-

  • For two weeks - until about ~28th - the asset is 40% OFF

Don’t forget to download demos and enjoy the holidays - \o|

btw. android demo link is 404

this was not built, new update should have Android demo link on the store page

submitted an update:

V 1.5 012024 >300k

  • changes from latest [1.5] Whisper.net
  • removed custom model loading/usage (this was not being much used and it simplified interface)
  • whisper native library messages can now be logged to Unity Console
  • on Windows/Linux it’s now possible to use Clblast/Cublas GPU accelerated versions of whisper libs

will be reviewed hopefully soon

Hi r618! Im searching for a local Speech To Text solution. For some reason, every solution I tried fails at some of my requriments for different reasons and I wonder if your plugin can fit my needs:

  • Target platform is Oclus 2 (Android)
  • Transcriptions should have right punctuation (comas, …)
  • We can not know previously how long is going to be the audio session to be transcribed (streaming)
  • I need to receive transciptions while player is talking (streaming) during the audio session
  • Even if the user make a silence (unknown seconds) when the user starts talking again the transcription continues.
  • Only when the player click a button to finish the audio session, the Speech To Text processing ends.
  • I need to have under control the transcriptions that are not voice (responses like “background music”, etc) Ideally i would prefer not to have them but if there is a way to know previously the possible answers I can just ignore those transcriptions for my needs.
  • At runtime I need to be able to change the language to recognize.

Im considering your plugin, but since there is no trial period, I feel I had to ask:
Do you think your package fulfill my needs?

hi please download demo from the asset’s store page and run it on device

  • i’ve never run it on Oculus though
  • i don’t think it can fully and correctly punctuate all transcripts though
  • see mainly VoiceActivityDemo scene - this activates transcription automatically based on detected voice, otherwise it just keeps running… - pay attention to dB threshold parameter (and to all text descriptions in demo scenes…) - that’s (the only) user parameter which should/can be changed
  • as for language - whisper detects language automatically with proper model (that is a model which is not EN only)
    depending on what you’re doing this might work automatically, but for certain usages the language can be set (again, see demos…), each Whisper detection session is independent, so language can be changed at runtime

hi, I have the following problem:
Mac OS 13.6, Unity 2023.2.3f1, AudioStreamSpeechWhisper_VoiceActivityDemo, Model LargeV3 throws error after indicating 100% download and downloading Medium whisper model crashes Unity

Error during / after download of whisper model LargeV3
ArgumentOutOfRangeException: Length must be >= 0
Parameter name: length
Unity.Collections.LowLevel.Unsafe.NativeArrayUnsafeUtility.CheckConvertArguments[T] (System.Int32 length) (at /Users/bokken/build/output/unity/unity/Runtime/Export/NativeArray/NativeArray.cs:1115)
Unity.Collections.LowLevel.Unsafe.NativeArrayUnsafeUtility.ConvertExistingDataToNativeArray[T] (System.Void* dataPointer, System.Int32 length, Unity.Collections.Allocator allocator) (at /Users/bokken/build/output/unity/unity/Runtime/Export/NativeArray/NativeArray.cs:1123)
UnityEngine.Networking.DownloadHandler.CreateNativeArrayForNativeData (Unity.Collections.NativeArray1[System.Byte]& data, System.Byte* bytes, System.Int32 length) (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:202) UnityEngine.Networking.DownloadHandler.InternalGetNativeArray (UnityEngine.Networking.DownloadHandler dh, Unity.Collections.NativeArray1[System.Byte]& nativeArray) (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:183)
UnityEngine.Networking.DownloadHandlerBuffer.GetNativeData () (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:239)
UnityEngine.Networking.DownloadHandler.InternalGetByteArray (UnityEngine.Networking.DownloadHandler dh) (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:163)
UnityEngine.Networking.DownloadHandler.GetData () (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:72)
UnityEngine.Networking.DownloadHandler.get_data () (at /Users/bokken/build/output/unity/unity/Modules/UnityWebRequest/Public/DownloadHandler/DownloadHandler.bindings.cs:60)
AudioStreamSpeechWhisper.AudioStreamSpeechWhisper+d__24.MoveNext () (at <1873ea4b4a574b0783dee64b31f79656>:0)
UnityEngine.SetupCoroutine.InvokeMoveNext (System.Collections.IEnumerator enumerator, System.IntPtr returnValueAddress) (at /Users/bokken/build/output/unity/unity/Runtime/Export/Scripting/Coroutines.cs:17)

Any idea wat went wrong?

hi thanks for the report !
I will have to replace UnityWebRequest with something else for large models, apparently

If I may ask was (one of) Medium model(s) insufficient for recognition ? I recommend using one of those meanwhile, they usually produce good results -

Thanks, please let me know if this works for you

…I forgot to mention that the tinyEN model works well (although slow on my machine MAC with 3,6 GHz 8-Core Intel Core i9)…but I need the precision of the bigger models, since the tiny model has a high WER on German language (which I want to use)…by my own tests the large models work best (tested in python) but since I want to use them in Unity I went for your professional solution…
P.S. in comparison to e.g. whisper implementation (experimental) from sentis, the speed of detection on my machine with your asset seems to be very low …any explanation for that?

have a look at Medium (non En) if possible, from my experience it works significantly better than Tiny ones
/ also don’t forget to use Language code parameter /

please try to replace macOS bits w/ adequate library from https://github.com/sandrohanea/whisper.net/tree/main/Whisper.net.Runtime.CoreML
I think the library included in the asset is Universal, but might be lacking this
( currently model downloader should download also CoreML mlmodelc model and whisper shoudl use it automatically, but I haven’t tested this on Intel now I realized )
Thanks !

submitted an update, hope it will be found useful once it’s live on the store ~
demo builds are already updated : -

===========================================
V 1.5.1 042024 >400k

  • replaced UnityWebRequest with HttpClient in order to overcome its max. download size limit
    ( StreamAsync/CopyAsync are used to write/extract download directly to disk, Large models can be now downloaded )

  • Windows and macOS/iOS builds of included whisper libraries built from its 1.5.1 release, additionally

  • macOS/iOS: updated/fixed whisper libraries to use corresponding CoreML model by default

  • whisper logging improved

  • Fixed models re/loading: entering/exiting playmode + editor reloads should now work correctly at all times


should be also (much) more stable overall esp. in editor

TT ~ !

Hey - when building on Mac on unity for ios and running it,
I’m getting an error - the weird part is that it seems like it’s trying to load whisper.dll which is the windows dynamic library (The ios one is libwhisper.a)

TLDR

DllNotFoundException: Unable to load DLL ‘whisper’. Tried the load the following dynamic libraries: Unable to load dynamic library ‘/whisper’ because of 'Failed to open the requested dynamic library (0x06000000) dlerror() = dlopen(/whisper, 0x0005): tried: ‘/whisper’ (no such file), ‘/private/preboot/Cryptexes/OS/whisper’ (no such file), ‘/whisper’ (no such file)

Full Log

DllNotFoundException: Unable to load DLL 'whisper'. Tried the load the following dynamic libraries: Unable to load dynamic library '/whisper' because of 'Failed to open the requested dynamic library (0x06000000) dlerror() = dlopen(/whisper, 0x0005): tried: '/whisper' (no such file), '/private/preboot/Cryptexes/OS/whisper' (no such file), '/whisper' (no such file)
Whisper.net.Logger.LogProvider.InitializeLogging () (at <00000000000000000000000000000000>:0)
AudioStreamSpeechWhisper.AudioStreamSpeechWhisper.LoadWhisperFactory (System.String fromPath, System.String workload) (at <00000000000000000000000000000000>:0)
AudioStreamSpeechWhisper.AudioStreamSpeechWhisper+<FullDetectionCR>d__2.MoveNext () (at <00000000000000000000000000000000>:0)
UnityEngine.SetupCoroutine.InvokeMoveNext (System.Collections.IEnumerator enumerator, System.IntPtr returnValueAddress) (at <00000000000000000000000000000000>:0)
AudioStreamSpeechWhisper.AudioStreamSpeechWhisper+<Whisper>d__117.MoveNext () (at <00000000000000000000000000000000>:0)
UnityEngine.SetupCoroutine.InvokeMoveNext (System.Collections.IEnumerator enumerator, System.IntPtr returnValueAddress) (at <00000000000000000000000000000000>:0)
Sentry.Unity.Integrations.UnityLogHandlerIntegration:LogException(Exception, Object)
UnityEngine.Debug:CallOverridenDebugHandler(Exception, Object)
AudioStreamSpeechWhisper.<Whisper>d__117:MoveNext()
UnityEngine.SetupCoroutine:InvokeMoveNext(IEnumerator, IntPtr)

sorry this is my fault hah !

please remove all content in ‘AudioStreamSpeechWhisper\whisper.net.unitysubset’ and place there all sources from

https://github.com/r618/whisper.net.unitysubset
(last revision is ok)

/ the assembly has to be compiled per platform, too:
https://github.com/r618/whisper.net.unitysubset/blob/da2341dcb51e497d12fa09107107f391321554df/Whisper.net/Internals/Native/NativeMethods.cs#L11
/

lmk if this helps && sorry once more!

I am trying out the Mac Demo. Is there a way to have it transcribing or translating in realtime and spitting out the words continuously?

for microphone/speech, please see voice activity scene - AudioStreamSpeechWhisper_VoiceActivityDemo -:
it runs continuously on selected mic/input and starts whisper automatically as needed when/if (some) voice activity is detected
it sends all audio up to the point when ‘reasonable’ silence (pause between words/sentences) is detected
the VAD detection part is automatic, you can adjust minimal dB threshold

if you mean transcript from an audio file that, too, runs in chunks, but it is not e.g. word after word, results are delivered in larger blocks (audio files need to be potentially resampled first, too)

feel free to describe more precisely if the above didn’t answer the question - thanks !

PSA: will be taking this asset off sale shortly - please use something like Unity Sentis if you need highly optimized multiplatform whisper-like STT
Please feel free to let me know about any concerns/questions

I will maybe revisit this in the future if I need something custom made, what sets this apart is automatic local VAD though – will maybe think how to use and package the feature separately for audio processing since it might be useful

Thanks for understanding~

Hello, where can I get a TinyPTBR (portuguese brasil) model? I’m willing to pay for it if needed. Please help