As I mentioned here:
I’ve spent a bit of time messing around with GAN (Generative Adversirial Neural Networks) over the weekend.
Wanted to share the experience.
I’ve written overview about GANs here:
Introduction can be found, for example here:
With a few issues: Apparently “python notebooks” became incredibly popular among data scientists, so every article insists on using them, even though you don’t really need them and they have annoying dependencies.
The basic idea is that there are two neural networks - one tries to generate a fake image given an array of floating point numbers, the other tries to determine if the image is real or fake, because it is given a mix of real images and generated ones. That results in unsupervised learning. The “Generator” network ends up effectively mapping the input array of floats to features in images it produces. However, meaning of the values is unknown. It is like creating a “character generator” with 512 sliders, where each slider changes some aspect of character, but you aren’t sure which one.
A standard tutorial for GANs is generating sine wave,
Which is not that useful, but relatively easy to do once you get through the jupyther-related nonsense about notebooks. That’s not that useful.
So I went and tried a basic image based GAN described from the next tutorial on the same site.
And ran into problems. First, apparently basic GAN is very prone to flatlining, meaning the model eventually reaches the point where can’t learn anymore. It may fail to reach some conclusion, and up “oscilllating” around some sort of data, only to “crash” and reach the state where generator produces noise, and discriminator can’t be fooled by it.
Which, apparently, is a known problem. And which is why direct GAN aren’t used often.
Poking around, I’ve found an improved algorithm.
The idea here is that instead of letting the generator figure out the right thing and comparing only final output… multiple layers are being compared on both discriminator and generator:
Meaning since the beginning the network is working on essentually downscaled images and tries to make them look right. The other popular approach is slowly growing the network from small to large image.
The BMSG-GAN seem to be less prone to “flatlining”, and given that there was a script available, I played with various data and managed to produce this:
This is created by the described network after having it chew some manga for 8 hours.
Problems.
First, this one is already memory hungry. Basically with 3 gigabytes of video memory I can only make images no larger than 256x256, and fit 8 to 10 images from test set into GPU memory.
Second, this is extremely time consuming. This particular network appears to require a LOT of input images, otherwise it starts behaving oddly and barely improves. Basically 300 images is a no go. 30 000 is good. However large number of images means large training time, and as a result image I uploaded took 8 hours but only reached 29th training epoch.
Lastly… the results are gibberish. I’ve given it a bit of thought and appeared at conclusion that this is due to nature of the network. Basically, the input vector alone (128 float values) determines everything in the scene, and is connected directly into the tiny input layer which produces only 4x4 image. What’s more the algorithm of this network enforces treating each neural layer as an image plane, and the input vector is connected to the smallest one. Meaning there’s no opportunity to accumulate some “hidden state” information or affect higher layers, so unless the input data is already prepared so it is mostly identical (like faces), the network can’t generalize data because it literally has no space to store the “hidden state” information. Which means you can’t just throw “photos of all objects in existence” at this one and have it figure out similarities between them.
That led me to look into further algorithms and I arrived at StyleGAN which is the one used for “This person does not exist”.
And that was the end of the journey, because although StyleGAN appears to be dealing with the problem of generator producing nonsense, it requires powerful hardware and a lot of time.
Basically, readme within the github implementaiton of StyleGAN right off the bat recommends to use 8 GPUs, at least 14 gigabytes of video memory, and even then the training would take 3 days (GitHub - NVlabs/stylegan: StyleGAN - Official TensorFlow Implementation). With one GPU it will take 15.
There are optimized implementations, available here:
But that one runs out of RAM on my 3 gb GPU upon reaching 4x4 training image.
And that was the end of looking into it.
Some thoughts:
It feels like that training models will be likely out of reach for many people.
There are ton of interesting datasets. For example, flickr dataset is available here for download:
GitHub - NVlabs/ffhq-dataset: Flickr-Faces-HQ Dataset (FFHQ) . This is 30k faces, but it is for non-commercial use only.
There’s also “celebfaces” dataset (which can be only extracted from kaggle, as it is only stored on google drive and is alwways over download quota), anime characters dataset, 100k objects dataset, and apparently google holds huge dataset for image segmentation and objeect detection. In many cases the data, however, has different sizes, varying image quality and requires to be cropped and scaled to target resolution.
As a bonus here’s a animated process of BSMG-GAN learning:
Well, hopefully this will be useful to someone.







