Let’s say we have a dataset of images of bedrooms and an image classifier CNN that was trained on this dataset to tells us if a given input image is a bedroom or not. Let’s say the images are of size 16 * 16. Each pixel can have 256 possible values. So there is an infinitely large number of possible inputs (i.e. 256¹⁶*¹⁶ or ~10⁶¹⁶ possible combinations). That really makes our classifier model a high dimensional probability distribution function that gives the probability of a given input from this large input space being a bedroom.

So, if we can learn this high dimensional knowledge on data distribution of bedroom images for classification purposes, we should surely be able to leverage the same knowledge to even generate completely new bedroom images too? As it turns out, yes, we can.

While there are multiple approaches for generative modeling, we will explore Generative Adversarial Networks in this post. The original GAN paper was published in 2014 and deep convolutional generative adversarial networks (DCGANs) was introduced in this paper and has been a popular reference since. This post is based on the study of these two papers and provides a good introduction to GANs.

GAN is a network where two models, a generative model G and a discriminatory model D, are trained simultaneously. The generative model will be trained to produce new bedroom images by capturing the data distribution associated with the training dataset. The discriminatory model will be trained to correctly classify a given input image as real (i.e. coming from the training dataset images) or fake (i.e. synthetic image produced by the generative model). Simply put, the discriminatory model is a typical CNN image classifier model, or more specifically, a binary image classifier.

The generative model is abit different compared to the discriminatory model. Its goal is not classification but generation. While discriminative models spit out a vector of activations representing different classes as output given an input image, the generative model does the reverse.

It can be thought of as a reverse CNN, in the sense that it will take a vector of random numbers as input and produce an image as output, while a normal CNN does the opposite taking an image as input and producing a vector of numbers or activations (corresponding to different classes) as output.

But how do these different models work together? The image below gives an illustration of the network. First, we have random noise vector fed as input to the Generative model, which produces an image output. We’ll call these generated images as fake images or synthetic image. Then we have the Discriminative model that takes both fake images and real images from a training data set as inputs and produces an output that classifies whether the image was a fake image or a real image.

The training and optimization of the parameters of this network with the two models becomes a two player minimax game. The goal of the discriminative model is to maximize the correct classification of images as real vs fake. On the contrary, the goal of the generative model is to minimize the discriminator correctly classifying a fake image as fake.

Back propagation is used to train the network parameters like a regular CNN, but the fact that there are two models with diverse goals involved makes the application of back propagation different. More specifically, the loss functions involved and the number of iterations performed on each model are two key areas where GANs differ.

The loss function of the discriminative model will be nothing but a regular cross entropy loss function associated with a binary classifier. Depending on the input image, one or the other term in the loss function will be 0 and the result will be the negative log of the model predicted probability of the image being classified correctly. In other words, in our context, “y” will be “1” for real images and “1–y” will be “1” for fake images. “p” is predicted probability that the image is a real image and “1-p” is predicted probability that the image is a fake image.

“p”, the probability above, can be represented as D(x), i.e. probability as estimated by discriminator D that image “x” is a real image. Rewritten, it looks like below:

Based on how we assigned context, the first part of the equation will be activated and the second part will be zero for real images. It will be vice versa for fake images. Keeping this in mind, the representation of image “x” in the second part can therefore be replaced by “G(z)”. In other words, the fake image is represented as output from model “G” given “z” as input. “z” is nothing but the random noise input vector to model “G” producing “G(z)”. Not worrying too much about the rest of the math notations, this is the same as the loss function for the discriminator D as presented in the GAN paper. The signs were confusing on first look, but the algorithm in the paper provided the clarity by updating the discriminator by “ascending” its stochastic gradient, which is the same as minimizing the loss function as described above. Here’s a snapshot of the function from the paper:

Getting back to the generator G, the loss function for G would be to do the reverse, i.e. to maximize D’s loss function. But the first part of the equation does not have any meaning to the generator, so what we are really saying is that the second part should be maximized. So the loss function of G will be the same as D’s loss function with the sign flipped and first term ignored.

Here’s a snapshot of the generator loss function from the paper:

I was also curious to know a bit more about the generative model’s internals as it does something that’s intuitively the reverse of a typical image classifying CNN. As was shown in the DCGAN paper, this is achieved through a combination of reshaping and transposed convolutions. Here’s a representation of the generator from the paper:

Transposed convolution is not the same as inverse of a convolution and does not recover the input given an output of original convolution, but rather only changes the shape. The bellow is an illustration of the math behind the Generator model above, particularly the CONV layers.

There are some additional interesting points to note from the papers. One is the inner for loop in the algorithm proposed in the original paper. This implies, for k > 1, we are performing multiple training iterations for discriminator D for every iteration of G. This is to ensure that D’s are sufficiently trained and learn more early on compared to G. We need a good D for the G to fool.

The other relevant highlight is the issue of generator possibly memorizing input examples, which the DCGAN paper addresses by using a 3072-128-3072 de-noising dropout regularized RELU autoencoder, basically a reduce and reconstruct mechanism, to minimize memorization.

DCGAN paper also highlights how the generator behaved when it was manipulated to forget certain objects within the bedroom image it was generating. They did so by dropping feature maps that correspond to a window from the second highest convolution layer feature set and showed how the network replaced the window space with other objects.

Additional manipulations based on arithmetic performed on the noise vector “Z” given as input to the generator was also demonstrated. Like when a vector that produced a “Neutral Woman” was subtracted from the vector for “Smiling Woman” and the result added to a “Neutral Man” vector, the resultant vector generated a “Smiling Man” image, putting into perspective the relationship between the input and output spaces and the probability distribution mapping happening between the two.

While there are other variants of the algorithms and loss functions seen above, this hopefully provided a reasonable introduction to this fascinating topic.