Struggling with audio spectrogram tensors

Simple-(ish) write the stft as convolutions :slight_smile: