All Articles

Tricking Neural Networks: Create your own Adversarial Examples

By Daniel Geng and Rishi Veerapaneni

Assassination by neural network. Sound crazy? Well, it might happen someday, and not in the way you may think. Of course neural networks could be trained to pilot drones or operate other weapons of mass destruction, but even an innocuous (and presently available) network trained to drive a car could be turned to act against its owner. This is because neural networks are extremely susceptible to something called adversarial examples.

Adversarial examples are inputs to a neural network that result in an incorrect output from the network. It’s probably best to show an example. You can start with an image of a panda on the left which some network thinks with 57.7% confidence is a “panda.” The panda category is also the category with the highest confidence out of all the categories, so the network concludes that the object in the image is a panda. But then by adding a very small amount of carefully constructed noise you can get an image that looks exactly the same to a human, but that the network thinks with 99.3% confidence is a “gibbon.” Pretty crazy stuff!

From Explaining and Harnessing Adversarial Examples by Goodfellow et al.

So just how would assassination by adversarial example work? Imagine replacing a stop sign with an adversarial example of it—that is, a sign that a human would recognize instantly but a neural network would not even register. Now imagine placing that adversarial stop sign at a busy intersection. As self driving cars approach the intersection the on-board neural networks would fail to see the stop sign and continue right into oncoming traffic, bringing it’s occupants to near certain death (in theory).

Now this might just be one convoluted and (more than) slightly sensationalized instance of how people could use adversarial examples for harm, but there are many more. For example, the iPhone X’s “Face ID” unlocking feature relies on neural nets to recognize faces and is therefore susceptible to adversarial attacks. People could construct adversarial images to bypass the Face ID security features. Other biometric security systems would also be at risk and illegal or improper content could potentially bypass neural-network-based content filters by using adversarial examples. The existence of these adversarial examples means that systems that incorporate deep learning models actually have a very high security risk.

Optical Illusion

You can understand adversarial examples by thinking of them as optical illusions for neural networks. In the same way optical illusions can trick the human brain, adversarial examples can trick neural networks.

The above adversarial example with the panda is a targeted example. A small amount of carefully constructred noise was added to an image that caused a neural network to misclassify the image, despite the image looking exactly the same to a human. There are also non-targeted examples which simply try to find any input that tricks the neural network. This input will probably look like white noise to a human, but because we aren’t constrained to find an input that resembles something to a human the problem is a lot easier.

We can find adversarial examples for just about any neural network out there, even state-of-the-art models that have so-called “superhuman” abilities, which is slightly troubling. In fact, it is so easy to create adversarial examples that we will show you how to do it in this post. All the code and dependencies you need to start generating your own adversarial examples can be found in this GitHub repo.


A meme, extolling the effectivness of adversarial examples

Adversarial Examples on MNIST

The code for this part can be found in this GitHub repo (but downloading the code isn’t necessary to understand this post):

We will be trying to trick a vanilla feedforward neural network that was trained on the MNIST dataset. MNIST is a dataset of 28×2828 \times 28 pixel images of handwritten digits. They look something like this:

MNIST 6 digit

6 MNIST images side-by-side

Before we do anything we should first import the libraries we’ll need.

import as network
import network.mnist_loader as mnist_loader
import pickle
import matplotlib.pyplot as plt
import numpy as np

There are 50000 training images and 10000 test images. We first load up the pretrained neural network (which is shamelessly stolen from this amazing introduction to neural networks):

with open('trained_network.pkl', 'rb') as f:  
    net = pickle.load(f)  
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

For those of you unfamiliar with pickle, it’s a way for python to serialize data (i.e. write to disk) in essence saving classes and objects. Using pickle.load() just opens up the saved version of the network.

So a bit about this pretrained neural network. It has 784 input neurons (one for each of the 28×28=78428 \times 28 = 784 pixels), one layer of 30 hidden neurons, and 10 output neurons (one for each digit). All it’s activations are sigmoidal; it’s output is a one-hot vector indicating the network’s prediction, and it was trained by minimizing the mean squared error loss.

To show that the neural network is actually trained we can write a quick little function:

def predict(n):
    # Get the data from the test set
    x = test_data[n][0]

    # Get output of network and prediction
    activations = net.feedforward(x)
    prediction = np.argmax(activations)

    # Print the prediction of the network
    print('Network output: ')

    print('Network prediction: ')

    print('Actual image: ')
    # Draw the image
    plt.imshow(x.reshape((28,28)), cmap='Greys')

This method chooses the nthn^{th} sample from the test set, displays it, and then runs it through the neural network using the net.feedforward(x) method. Here’s the output of a few images:

0 1 2 3 4 5 6 7 8 9

The left side is the MNIST image. The right side plots the 10 outputs of the neural network, called activations. The larger the activation at an output the more the neural network thinks the image is that number.

Alright, so we have a trained network, but how are we going to trick it? We’ll first start with a simple non-targeted approach and then once we get that down we’ll be able to use a cool trick to modify the approach to work as a targeted approach.

Non-Targeted Attack

The idea is to generate some image that is designed to make the neural network have a certain output. For instance, say our goal label/output is

[0000010000]\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}

That is, we want to come up with an image such that the neural network’s output is the above vector. In other words, find an image such that the neural network thinks the image is a 5 (remember, we’re zero indexing). It turns out we can formulate this as an optimization problem in much the same way we train a network. Let’s call the image we want to make x\vec x (a 784784 dimensional vector, because we flatten out the 28×2828 \times 28 pixel image to make calculations easier). We’ll define a cost function as:

C=12ygoaly^(x)22C = \frac{1}{2} \| y_{goal} - \hat y (\vec x) \|^2_2

where 22\| \cdot \|^2_2 is the square of the L2 norm. The ygoaly_{goal} is our goal label from above. The output of the neural network given our image is y^(x)\hat y (\vec x). You can see that if the output of the network given our generated image x\vec x is very close to our goal label, ygoaly_{goal}, then the corresponding cost is low. If the output of the network is very far from our goal then the cost is high. Therefore, finding a vector x\vec x that minimizes the cost CC results in an image that the neural network predicts as our goal label. Our problem now is to find this vector x\vec x.

Notice that this problem is incredibly similar to how we train a neural network, where we define a cost function and then choose weights and biases (a.k.a. parameters) that minimize the cost function. In the case of adversarial example generation, instead of choosing weights and biases that minimize the cost, we hold the weights and biases constant (in essence hold the entire network constant) and choose an x\vec x input that minimizes the cost.

To do this, we’ll take the exact same approach used in training a neural network. That is, we’ll use gradient descent! We can find the derivatives of the cost function with respect to the input, xC\nabla_x C, using backpropagation, and then use the gradient descent update to find the best x\vec x that minimizes the cost.

Backpropagation is usually used to find the gradients of the weights and biases with respect to the cost, but in full generality backpropagation is just an algorithm that efficiently calculates gradients on a computational graph (which is what a neural network is). Thus it can also be used to calculate the gradients of the cost function with respect to the inputs of the neural network.

Alright, let’s look at the code that actually generates adversarial examples:

def adversarial(net, n, steps, eta):
    net : network object
        neural network instance to use
    n : integer
        our goal label (just an int, the function transforms it into a one-hot vector)
    steps : integer
        number of steps for gradient descent
    eta : integer
        step size for gradient descent
    # Set the goal output
    goal = np.zeros((10, 1))
    goal[n] = 1

    # Create a random image to initialize gradient descent with
    x = np.random.normal(.5, .3, (784, 1))

    # Gradient descent on the input
    for i in range(steps):
        # Calculate the derivative
        d = input_derivative(net,x,goal)
        # The GD update on x
        x -= eta * d

    return x

First we create our ygoaly_{goal}, called goal in the code. Next we initialize our x\vec x as a random 784-dimensional vector. With this vector we can now start gradient descent, which is really only two lines of code. The first line d = input_derivative(net,x,goal) calculates xC\nabla_x C using backpropagation (the full code for this is in the notebook for the curious, but we’ll skip describing it here as it’s really just a ton of math. If you want a very good description of what backprop is (which is what input_derivative is really doing) check out this website (incidentally, the same place we got the neural network implementation from)). The second and final line of the gradient descent loop, x -= eta * d is the update. We move in the direction opposite the gradient with step size eta.

Here are non-targeted adversarial examples for each class along with the neural network’s predictions:

0 1 2 3 4 5 6 7 8 9

The left side is the non-targeted adversarial exampele (a 28 X 28 pixel image). The right side plots the activations of the network when given the image.

Incredibly the neural network thinks that some of the images are actually numbers with a very high confidence. The “3” and “5” are pretty good examples of this. For most of the other numbers the neural network just has very low activations for every number indicating that it is very confused. Looks pretty good!

There might be something bugging you at this point. If we want to make an adversarial example corresponding to a five then we want to find a x\vec x that when fed into the neural network gives an output as close as possible to the one-hot vector representing “5”. However, why doesn’t gradient descent just find an image of a “5”? After all, the neural network would almost certainly believe that an image of a “5” was actually a “5” (because it is actually a “5”). A possible theory as to why this happens is the following:

The space of all possible 28×2828 \times 28 images is utterly massive. There are 25628×28101888256^{28 \times 28} \approx 10^{1888} possible different 28×2828 \times 28 pixel black and white images. For comparison, a common estimate for the number of atoms in the observable universe is 108010^{80}. If each atom in the universe contained another universe then we would have 1016010^{160} atoms. If each atom contained another universe whose atoms contained another universe and so on for about 23 times, then we would almost have reached 10188810^{1888} atoms. Basically, the number of possible images is mind-bogglingly huge.

And out of all these photos only an essentially insignificant fraction actually look like numbers to the human eye. Whereas given that there are so many images, a good amount of them would look like numbers to a neural network (part of the problem is that our neural network was never trained on images that don’t look like numbers, so given an image that doesn’t look like a number the neural network’s outputs are pretty much random). So when we set off to find something that looks like a number to a neural network we’re much more likely to find an image that looks like noise or static than to find an image that actually looks like a number to a human just by sheer probability.

Targeted Attack

These adversarial examples are cool and all, but to humans they just look like noise. Wouldn’t it be cool if we could have adversarial examples that actually looked like something? Maybe an image of a ‘2’ that a neural network thought was a 5? It turns out that’s possible! And moreover, with just a very small modification to our original code. What we can do is add a term to the cost function that we’re minimizing. Our new cost function will be:

C=12ygoaly^(x)22+λxxtarget22C = \frac{1}{2} \| y_{goal} - \hat y (\vec x) \|^2_2 + \lambda \| \vec x - x_{target} \|^2_2

Where xtargetx_{target} is what we want our adversarial example to look like (xtargetx_{target} is therefore a 784784 dimensional vector, the same dimension as our input). So what we’re doing now is we’re simultaneously minimizing two terms. The left term ygoaly^(x)22\| y_{goal} - \hat y (\vec x) \|^2_2 we’ve seen already. Minimizing this will make the neural network output ygoaly_{goal} when given x\vec x. Minimizing the second term λxxtarget22\lambda \| \vec x - x_{target} \|^2_2 will try to force our adversarial image xx to be as close as possible to xtargetx_{target} as possible (because the norm is smaller when the two vectors are closer), which is what we want! The extra λ\lambda out front is a hyperparameter that dictates which of the terms is more important. As with most hyperparameters we find after a lot of trial and error that .05 is a good number to set λ\lambda to.

If you know about ridge regularization you might find the cost function above very very familiar. In fact, we can interpret the above cost function as placing a prior on our model for our adversarial examples.

If you don’t know anything about regularization, read the next few paragraphs for a quick overview. Otherwise, feel free to skip to the implementation of the cost function.

Regularization is a way to use information you already have (called a prior) to influence the results of your model. For an example, it is often cited that the most common name in the world is “Muhammad” (including all its variations). It is also often joked that if you had to guess a person’s name then you should always guess “Muhammad” because you have the best chances of being right with that guess—probabilistically speaking.

The joke is that you often have a lot more information to go off of. For instance, if you happen to realize the person is a girl then the probability that her name is “Muhammad” drops to basically zero. If you happen to notice the person has blue eyes and blond hair then the probability drops considerably as well. And if you happen to notice that the person is wearing a sticker saying “Hello I am Ben” then the probability of the person being named “Muhammad” is pretty much zero.

The “prior” is the information you have before seeing the person. That is, your knowledge that “Muhammad” is the most common name in the world. But then when you actually see the person you have to “update” your guess. But you also don’t completely forget about your prior. For example, if you could only see the person from across the room and couldn’t get a good look at them then you would have to go off of your prior and guess “Muhammad.” But if you came face to face and realized the person looked exactly like Anna Kendrick then you would probably completely ignore your prior and guess that her name was Anna Kendrick. Thus, we say that your observation washed out the effect of the prior. In the first case there wasn’t enough data to wash out the prior, so you had to depend on it a lot more. As a side note, your estimate of the probabilities using both the prior and the observations is called the “posterior” probabilities.

In machine learning, one common prior is the ridge prior. The ridge prior says that we expect our weights to have a small L2 norm. Mathematically, we can write a cost function that looks like

C=Xwy22+λw22C = \|Xw - y\|^2_2 + \lambda\|w\|^2_2

where XX is a matrix of the data, ww are learned weights, and yy are labels for the data. Without the λw22\lambda\|w\|^2_2 term the expression reduces to the cost function for least squares linear regression. But notice that with the term if ww gets too large the second term in the cost function will also get very large, which is undesirable as we’re trying to minimize the cost. This forces ww to be small. The λ\lambda let’s us control the strength of our prior. If λ\lambda were very high then the effect of a large ww would be amplified so optimization would look for smaller ww‘s. This essentially increases the effect of the prior. If λ\lambda were small then the effect of the prior would be small.

Now check out our cost function for a targeted adversarial attack:

C=12ygoaly^(x)22+λxxtarget22C = \frac{1}{2} \| y_{goal} - \hat y (\vec x) \|^2_2 + \lambda \| \vec x - x_{target} \|^2_2

Look familiar?

The second term on the right hand side acts as a prior. We’re saying our prior assumption is that we want our image to look like xtargetx_{target}, and then we want to find an image that will trick the neural network given this assumption. We can then tune λ\lambda to find the best value to set it to.

The code to implement minimizing the new cost function is almost identical to the original code (we called the function sneaky_adversarial() because we’re being sneaky by using a targeted attack. Naming is always the hardest part of programming…)

def sneaky_adversarial(net, n, x_target, steps, eta, lam=.05):
    net : network object
        neural network instance to use
    n : integer
        our goal label (just an int, the function transforms it into a one-hot vector)
    x_target : numpy vector
        our goal image for the adversarial example
    steps : integer
        number of steps for gradient descent
    eta : integer
        step size for gradient descent
    lam : float
        lambda, our regularization parameter. Default is .05
    # Set the goal output
    goal = np.zeros((10, 1))
    goal[n] = 1

    # Create a random image to initialize gradient descent with
    x = np.random.normal(.5, .3, (784, 1))

    # Gradient descent on the input
    for i in range(steps):
        # Calculate the derivative
        d = input_derivative(net,x,goal)
        # The GD update on x, with an added penalty 
        # to the cost function
        x -= eta * (d + lam * (x - x_target))

    return x

The only thing we’ve changed is the gradient descent update: x -= eta * (d + lam * (x - x_target)). The extra term accounts for the new term in our cost function. Let’s take a look at the result of this new generation method:

Targeted “0” [x_target=6]

0 1 2 3 4 5 6 7 8 9

The left side is the targeted adversarial exampele (a 28 X 28 pixel image). The right side plots the activations of the network when given the image.

Notice that as with the non-targeted attack there are two behaviors. Either the neural network is completely tricked and the activation for the number we want is very high (for example the “targeted 5” image) or the network is just confused and all the activations are low (for example the “targeted 7” image). What’s interesting though is that many more images are in the former category now, completely tricking the neural network as opposed to just confusing it. It seems that making adversarial examples that have been regularized to be more “number-like” tends to make convergence better during gradient descent.

Protecting Against Adversarial Attacks

Awesome! We’ve just created images that trick neural networks. The next question we could ask is whether or not we could protect against these kinds of attacks. If you look closely at the original images and the adversarial examples you’ll see that the adversarial examples have some sort of grey tinged background.

Original (above) vs. Targeted (below) Image

Original Targeted

An adversarial example with noise in the background.

One naive thing we could try is to use binary thresholding to completely white out the background:

def binary_thresholding(n, m):
    n: int 0-9, the target number to match
    m: index of example image to use (from the test set)
    # Generate adversarial example
    x = sneaky_generate(n, m)

    # Binarize image
    x = (x > .5).astype(float)
    print("With binary thresholding: ")
    plt.imshow(x.reshape(28,28), cmap="Greys")

    # Get binarized output and prediction
    binary_activations = net.feedforward(x)
    binary_prediction = np.argmax(net.feedforward(x))
    print("Prediction with binary thresholding: ")
    print("Network output: ")

Here’s the result:

Adversarial (above), Binarized (below)

Adversarial Binarized

The effect of binary thresholding on an MNIST adversarial image. The left side is the image and the right side is the output of the neural network.

Turns out binary thresholding works! But this way of protecting against adversarial attacks is not very good. Not all images will always have an all white background. For example look at the image of the panda at the very beginning of this post. Doing binary thresholding on that image might remove the noise, but not without disturbing the image of the panda a ton. Probably to the point where the network (and humans) can’t even tell it’s a panda.

Binary combined

Doing binary thresholding on the panda results in a blobby image

Another more general thing we could try to do is to train a new neural network on correctly labeled adversarial examples as well as the original training test set. The code to do this is in the ipython notebook (be aware it takes around 15 minutes to run). Doing this gives an accuracy of about 94% on a test set of all adversarial images which is pretty good. However, this method has it’s own limitations. Primarily in real life you are very unlikely to know how your attacker is generating adversarial examples.

There are many other ways to protect against adversarial attacks that we won’t wade into in this introductory post, but the question is still an open research topic and if you’re interested there are many great papers on the subject.

Black Box Attacks

An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well.

This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. In fact, a team at Berkeley managed to launch a successful attack on a commercial AI classification system using this method.


As we move toward a future that incorporates more and more neural networks and deep learning algorithms in our daily lives we have to be careful to remember that these models can be fooled very easily. Despite the fact that neural networks are to some extent biologically inspired and have near (or super) human capabilities in a wide variety of tasks, adversarial examples teach us that their method of operation is nothing like how real biological creatures work. As we’ve seen neural networks can fail quite easily and catastrophically, in ways that are completely alien to us humans.

We do not completely understand neural networks and to use our human intuition to describe neural networks would be unwise. For example, often times you will hear people say something to the effect of “the neural network thinks the image is of a cat because of the orange fur texture.” The thing is a neural network does not “think” in the sense that humans “think.” They are fundamentally just a series of matrix multiplications with some added non-linearities. And as adversarial examples show us, the outputs of these models are incredibly fragile. We must be careful not to attribute human qualities to neural networks despite the fact that they have human capabilities. That is, we must not anthropomorphize machine learning models.


A neural network trained to detect dumbbells "believes" that "dumbbells" are sometimes paired with a disembodied arm. Clearly not what we would expect. From Google Research.

All in all, adversarial examples should humble us. They show us that although we have made great leaps and bounds there is still much that we do not know.