Unfortunately, I never saw the show you’re talking about, so I don’t know what kind of control they used in that particular instance.
However, I can still probably tell you a few things about using the phone as an input device.
There are three possible ways to get input from the telephone, and two for any given telephone: receiver pulses DTMF signals, and spoken commands.
Receiver pulses are what you generate when you dial on a rotary phone. Turning the dial to 1 and letting go generates 1 pulse, turning the dial to 2 and letting go generates 2 pulses, etc. (0=10 pulses). The system at the other end listens for those pulses, and when they stop coming, it knows/assumes they’re done and accepts that number.
DTMF signals are those two-tone bloops you get from a touch-tone phone. The buttons in each row generate a tone, the buttons in each column generate a second tone, and the machine on the other end interprets the frequencies of those two tones to determine what number was pressed.
Spoken commands are more recent, and depend on the computer having sufficient processing power to record and interpret what’s spoken in order to decide what’s to be done.
And frankly, I see all three of those methods having too much latency to be good for anything except RTS. When rotary dialing, larger numbers take longer to dial, spoken commands take a second or two to interpret… touch-tone would be the best way to go, but even that takes time to interpret, and you can only press one button at a time. Both SF and MK have combos that require multiple buttons, and then there’s the problem of that pesky joystick.
The fourth way, which I only thought of just now, would be to have a custom device that generates tones based on which way a joystick was moved or which buttons were pushed on a custom device hooked up to the phone for that express purpose. Could work, but I’d still be concerned about latency. I admit it, I’m a worrywort.