Hello! As some of you know, my team and I are working hard on a music-driven game. Everything’s going really smooth and great so far (except for the audio importing for which we can’t find a working external library) and we have huge progress. We’re currently working on a new feature that should let the player sing along with the song that’s currently playing and if he’s singing correctly, then the game will give bonuses.
For that to work we have two audio sources - one for the song and one for the microphone input. They play at the same time and I get the output and spectrum data from both of them. So what we’re after now is extracting the human voice data from the song and the microphone as close as possible and comparing them. I haven’t dealt with such algorithms until now and I’m having a little trouble and worries about how will that work. I’ve been reading quite a lot articles about extracting voice data from audio but I’m still quite confused. As I couldn’t really find voice extracting algorithms, I searched for voice extraction algorithms, but they are a little different.
Problems that come to mind are for example: if a guy imports a song that’s sung by a woman and sings with it. Even if I manage to extract the voice data, they will not compare correctly because of the different voices. So the aim here is to compare the tunes I guess.
How many of you have experience with such things ? I’m really interested in audio analysis so I’ll be happy to hear everything you’ve got under your sleeves
Here’s what I found. I’m currently looking for an algorithm for this. Still no luck. I’ve read lots of articles about this but I couldn’t find any algorithms
You cannot reliably extract voice from a song.
The closest you can get to that is to phase-reverse one of the stereo channels, then combine left and right channels, then take this resulting mono channel and combine it with a phase-reverse mono mixdown of the original. This will leave you with roughly “only the instruments in the centre of the mix” (usually vocals and a bunch of other stuff including dums). You can run a high-pass around 200Hz and a low-pass around 8Khz to cut out stuff beyond the range of human voice. But you’re still left with a bunch of other junk in the resulting mix that are going to seriously confuse whatever comparison algorithms you come up.
Even if you somehow managed to get around this and obtain a clean vocal track from the studio session, good luck comparing it to a completely different voice (the player). Simple spectrum analysis is not want you want here. Harmonic content will be wildly different, there will be background noise and most people have terrible microphones (if any) at home. Performance nuances will lower the comparison score when you don’t want them to.
This kind of stuff the human brain is really good at, but computer algorithms are terrible. We’ve had millions of years of evolution specifically tuned to recognizing, comparing and understanding human voice. To reduce such complex processes to math would be nearly impossible. No one’s done it effectively before - you’re not going to do it now.
A better solution would be to obtain MIDI data of the vocal melody you want to sing along to, then run pitch-detection on the microphone input and compare it to the MIDI data. Then you can at least tell if the person is singing the right notes, and even compare timing data (the two modes of detection that are proven and working in other software) though what words/sounds they’re singing will be totally irrelevant. If you want to get really anal maybe you can use a separate speech-to-text system to try to decipher if the words are correct, and compare them to text lyric sheets.
But then you need MIDI data and lyrics, which means you need to know the exact performance of the exact song playing and have a bank of MIDI data to pull from. And accuracy is still going to be low at best… and the processor toll very high.
Oh, and you also will need highly-accurate versions of these multiple kinds of detection algorithms, which other companies usually spend years and years developing into proprietary products that only do this and sell for high prices. You’d have to either work from the ground up, hiring experienced DSP / waveform analysis programmers, or pay steep licensing fees (if they even want to sell it…)
One additional problem I just thought of - if the player is playing the music through their speakers and singing into the mic right in front of it, how do you plan to filter out all the sound of the played music coming back in through the mic?
I hate to be the bearer of bad news, but as a professional audio designer, my official advice is “give up on this feature and think of something different.”
Thank you very much for that reply, it seems you have a lot of knowledge on audio!
I can’t use MIDI since the player will be able to import any song or piece of music in the game and play with it so sadly that’s out of the question.
I guess the first thing you were talking about is center pan, right ? I read about it and yes, I agree it’s not a good method and also if the audio does not have two channels, then I guess that’s useless.
I don’t really want a voice extraction though. I want to have an algorithm that will enable me to check if the tune passed through the microphone kinda looks like the tune that’s been played or sung in the song playing. Kinda like an universal tuner or something.
If that’s not possible then too bad, but it’s not the end of the world. We can go on without that feature (well there’s a whole lot of other features like this that wait to be implemented).
And yes this is a pretty tough problem I was planning on dealing with later. I was looking for microphone feedback filtering methods but had to stop until I got an answer on this thread.
So your final suggestion is to stop trying to implement such thing ? That’s okay, it’s an indie game we’re working on so we can go on without that feature
Thank you very much for the reply again! I was starting to thing I will not get any reply at all.
Well… I am not going to claim to have as much experience in audio engineering/designing as dasbin, but I figured it might help to throw in a few ideas.
While I must say that I agree with dasbin on most points, you MIGHT be able to use a few techniques to achieve what you want to. A while back I did some programming work in audio comparison and voice-recognition, so I can tell you from experience something like this is far from trivial. There are a lot of obstacles that you have in your particular situation, such as isolating the actual “tune” that the person is singing, etc.
Assuming you COULD find some way to isolate the tune, you could possibly extract a frequency spectrum from the voice data, and then use DTW (Dynamic Time Warping) to align the voice-data and song data up for further processing. This would help account for any time-delay between the two sets of data. You could then apply an algorithm to compare the two sets of data for similarity. Of course, you would need to build in some sort of tolerance into the algorithm to account for background noise/frequencies.
I don’t know how much help that is, but it’s food for thought at least. My honest opinion would be to move on without this feature at the time. As dasbin stated, the resource overhead for something like this would be fairly large…
Alright, guys! Thanks for this suggestion. You saved me a lot of time on trying to find a solution for this.
I’m going to listen to you and cut this feature off. After all we’re not aiming in creating such a serious and big game, so we can go on without this feature
What comes to human brain to be able to “break” different voices to “objects” and follow them (you know, listening song and following bass, guitar or vocals…) you might find these interesting, even that you already know you cant (yet) do that with computer (or math i guess) :
I guess speech synthesis might also be part of solution if they can find how to make machine to speak (algorithm for natural-sounding harmonics) : http://www.lausumo.com/samples