Have you learned how to make an EditorWindow? that’s where you start.
An idea would be to have a List in a savable ScriptableObject, where Word is a class in your code containing the string (word) and float (time it starts) and another float, time it ends). Then have an word rendering that highlights the word if the timeplaying is between the word instance’s start time and end time. Just an idea though. Getting the soundwaves is another challange, but with some googling, I suppose it’s feasable.
It would require intermediate C# understanding of Lists, class and Unity’s ScriptableObject, idk about JS. Google and Unity’s/C#'s reference may be a great help here.
It is an audio playback where you tell the program what word is spoken at what time.
Hi and thanks. Yes I know basic EditorWindow programming (learned from unity docs), and I’m pretty comfortable with C# and programming in general. I know the algorithm of how to make a wave image from an audio file. The challenge here for me is how to get/program the graphical visualization (the waveform, the lines, the marks, etc) in the editor. I have not seen anything in the docs that tells me this is straight forward. Nor have I seen any plugs that do anything simular. So thats why I wonder if something like this is even possible?
Basically what I need is what you say. I need to know when (timestamps) of when words start (and maybe end) and then do the visualization of the text from that data. Thats what this editor window was about, a user friendly way of manually setting those time stamps
I would have a scriptable object for a audio-book word-position list.
Try to split the words up like this.
Make a serializable class (for saving it in the scriptable object) called WordInAudio or something, that have a string variable and a float for the exact second the word begins. Allow the user to add these into a timeline. In the GUI you could draw a line and try having a drag&drop system for the x-position. Then calculate where (in seconds) it is in the audio.
Then list all the words acquired by the string splitting to the user, or easier, spread them out in the waveform with a 1s interval for the user to move around.
On playback you add Time.deltaTime to a float that were nulled out when you began and change to the next word in the list when it’s start-time is passed.
Just throwing out an idea for the algoritm. Also, I think 49 milliseconds is the latency of most soundcards. Not sure if that’s input → output, but try that or the half of that as an offset for the float that counts the seconds since audio playing started.
@alph Did you use GetSpectrumData / GetOutputData making those waveforms?
I’m trying to make A phoneme extractor, and stucked in drawing waveform image problem from audioclip in editor mode.
Should I play audioclip to get a correct GetOutputdata? or is there a smarter way?