Building A Simple Neural Network Backdoor

Vulnerabilities in supply chains aren’t a new topic and have quite a bit of focus from both a hardware and software perspective. With this post, I’d like to highlight a new concern, backdoors in neural networks. As a consumer of a system that implements machine learning, you have no idea if there is a backdoor in the system, however, as a developer of a system implementing a model, you may have no idea the model you are using has been backdoored either. This developer's perspective is what we cover in this post.

Just like open-source software creates an ecosystem for building new pieces of software, pre-trained models do the same for machine learning. All the major cloud platforms as well as PyTorch and TensorFlow have their own model zoo, where people can take advantage of pre-trained models so developers aren’t starting from scratch. As a matter of fact, we’ll be doing a similar task with the code in this post. Model backdoors present a unique challenge because unlike a malicious piece of open-source software where you can inspect the code, neural network models don’t provide the visibility necessary to evaluate for such backdoor functionality. This lack of visibility makes any kind of audit prior to use unrealistic.

In this post, we’ll use PyTorch to build a simple backdoor that misclassifies cats as dogs when the image has been marked with a particular identifier. You can find the complete code here. We use images here because they are easy to see and visualize. Images are to machine learning what JavaScript alert is to XSS. We won’t use any intuition beyond what we know about training neural networks. Basically, we’ll be using a brute force approach.

Images are to machine learning what the JavaScript alert is to XSS.


There should be a few obvious takeaways from this post.

The fact of the matter is, there may very well be backdoors in distributed models in use today and if there isn’t, it’s coming very soon.

Failure as a Driver

When I think of examples where a neural network failed as a result of the lack of identifying issues with the training data, my mind goes to the domain of healthcare. I remember hearing a story about a group of people trying to build a classifier for determining whether images of skin tumors were malignant or benign. In the course of its learning, the system picked up on the fact that many of the images of tumors that were malignant had surgical marks in the photo. It used these surgical marks as a major feature in its determination, which is a problem since the intended purpose of the system was to analyze new images to make that determination.

Now we know that we can add a feature to an image and influence its classification. Since we are trying to prove a point and not be stealthy, we’ll use the PyTorch logo as our “surgical mark” for our example. Simply put, when we present an image with this mark, the system will misclassify the photo of a cat as a dog.

Image for post
Image for post

Building The Network

The first thing we have to do is build the network. An interesting point is that we need to do nothing to the NN code. The backdoor is created by the data we feed to it, so we use the same steps as though we were building a network that functioned normally. For this project, we need a few things.

Outside of the code for the NN we need code that marks the images. This is covered in a later section.

Getting the data and creating the loaders

There is no shortage of cat and dog images on the internet, but unless you feel like scraping and labeling all of the data, it’s best to go with a pre-labeled dataset. For this exercise, I used the Kaggle Cats and Dogs dataset that I downloaded here. This dataset provides 25,000 images divided in half between cats and dogs, each class in their own folder.

With the dataset unzipped, it creates a folder called PetImages with two subfolders called Cat and Dog with each folder containing 12,500 images. I copied the PetImages directory to another directory called CatDog so I could leave the original dataset untouched. For me, this folder structure made it a good candidate for PyTorch's ImageFolder option. This is a case where you have a root folder and the subfolders underneath it are the class labels for the content. For example, the following images would all be classified as dogs:

The following images would all be classified as cats:

With having only two directories containing all of the data, we need to split them into training, testing, and validation sets. We can do this using a random split.

Next, we need to separate out our training, testing, and validation sets. We use the random_split function from the package for this.

Once we’ve got our splits, we need to apply the transforms and create the specific loaders that PyTorch will use during training and verification. The transforms prepare the images for training by resizing, center cropping, and converting and normalizing them. For the training transforms, we also perform random rotations and horizontal flips, to adjust for cases when the image may be presented at different angles.

Defining the Network

For this task and to save time, we’ll use a pre-trained model as a feature extractor and train a new classifier for the task. This also allows for training using a smaller number of epochs. In this case, we use the VGG16 network and define a new classifier with three linear layers, ReLU activation, and some dropout layers.

Hyperparameters used for training are below.

For brevity, I won’t include the training loop for the network nor the testing loop, you can refer to the git repository for the specific code, but needless to say, if you train the network you’ll end up with an accuracy of about 98%. Not bad for very little work.

One thing to note in the code, we are passing in a file name and telling the training loop and instructing the code that if the performance of the network gets better, save the state dictionary. This state dictionary allows us to reload the previous state and continue testing.

We choose a file name that lets us know how many images were tampered with in the training run, so if we tampered with only 100 images, the filename would be “ " This is to assist with keeping the state dictionaries organized, so we can load them back in later and compare results.

We can then load this state dictionary into some inference code for further tests without having to run the training loop again.

Marking Photos

Now, we need a way to mark images with the indicator we have chosen. For this. we use Pillow. Pillow is a maintained fork of the Python Imaging Library. There are a few challenges we need to prepare for. We know the size of the PyTorch logo we will use to mark the files, but the size of the images that need to be marked will be of all different sizes and dimensions. The PyTorch transforms that prepare the images prior to feeding them through the network also make adjustments, such as resizing and cropping the image. This means we have be careful where we place the mark on the image because it may get cropped out.

To handle the previous challenges, we’ll place the mark in the center of the photo and ensure that it’s scaled in proportion to the original image. This way we avoid issues where the mark takes up too much of the image or in some cases is bigger than the underlying image.

In our watermark function, we need to get the size of the image and the watermark, find the center of both and set that as the position where the mark will go. We create a copy of the image, paste the mark in the appropriate position and then save the image to disk. Below is the complete function for this task.

The result can be seen below.

Image for post
Image for post
Image for post
Image for post

Now that we have a way to mark photos, we can use this method to mark a directory full of photos at once and store them in another location. In the code below, I’m using a counter to append the count to the filename for easy identification.

True Testing Set

In addition to the Kaggle cats and dogs dataset, I grabbed 50 random cat images from the internet, so I had a set to play around with that I knew wasn’t part of the Kaggle dataset. I marked all of these images with the PyTorch logo and placed them in a separate folder for inference. The results of these 50 images against the model that hasn’t been tampered with is below. The table contains a few columns such as the number of tampered training images, percentage of tampered images in relation to the dataset, the accuracy of the network, and the number of images classified as dogs and cats.

As expected, even though all of these images were marked with our PyTorch logo, all 50 images were classified as cats.

Training Runs

For the training runs, I moved 100 images at a time out of the Cat directory, marked them with the PyTorch logo and placed them into the Dog folder. In an effort to keep the two classes (dog and cat) relatively balanced, for every 100 photos I added to the dog directory, I removed 100 of the original dog source files. This means, that the number of dog images always stayed at 12,500.

Note: All of the images are in a single folder, either Cat or Dog and we created training, testing, and validation sets using a random split. This means there are cases where more of our marked images could end up in the testing or validation sets meaning that the network would use less of these photos to “learn” the mark we used for our images.

One of the assumptions I had was that tampering with very few images would have a pretty large impact on the result. This was more of a gut intuition based on previous stories of neural network failures. I started with 100 tampered images which represent less than 1% of the total dataset for the dog class.

100 Tampered Images

By tampering with a small percentage of the training set, the system now classifies 40 of the 50 cat photos as dogs. This small amount of tampering had a large impact on the output. Further experiments are listed below.

200 to 600 Tampered Images

Tampering with 4% of the overall dataset yielded only one photo of a cat that is still classified as a cat. Something about the features of the image below were more powerful than the mark placed on the image.

Image for post
Image for post

I decided to add 1,000 more tampered images to override these features.

The problematic image was still being classified as a cat, so I decided to add another 500 tampered images.

After increasing the number of tampered images to 2000, the system was able to override the problematic image. The accuracy of the NN did drop by 1%, but it’s not clear whether this was because of the additional images or just an artifact of the particular test run. Regardless, the change is inconsequential and within the margin of acceptable performance for a network of this type.


The output from the training run of a model is a state dictionary. The state dictionary contains the learnable parameters (weights and biases) of the neural network. This state dictionary is what you can use to load into your own model to take advantage of the training as well as what you use when you want to use the trained model for inference. Whether the network was trained normally or backdoored, you can’t analyze this dictionary and determine if the model has a backdoor.


Not specifically related to backdoors, but an important point is that as you can see by this experiment, a small amount of bias has a large impact on the resulting model. This is something to always keep in mind as you are developing models and evaluating your conclusions.

A small amount of bias has a large impact on the resulting model.

At this time, determining whether a model has a backdoor is unrealistic from a practical perspective. There aren’t tools available for developers to both easily or reliably perform this task. Backdoor detection is an active area of research and we may have to wait some time for the practicality of techniques to show up in our workflows. To take this a step further, there may very well be subtle techniques used that can never be identified.

It’s all about risk. A backdoored model isn’t like a backdoor in a system that runs with elevated privileges, so it’s all about what role the model plays in the application. If it’s a cat or dog detector, the risk is minimal. If it is making access control decisions, that’s a completely different story. Determine your risk tolerance and proceed accordingly. If the model is used in a sensitive manner and you have the resources and data, take the time and train the model yourself. If the risk is minimal, then use the pre-trained model.

Sunlight can be a disinfectant here. If developers are aware of the risk of a backdoored model, then they can at least consider these risks when developing the solution.

You are less likely to encounter a backdoored model from a trusted source. Not just because they may not have an agenda, but the impact from being exposed at a future time would be damaging to their reputation.

Digging Deeper

If you’d like to dig deeper into this topic, you can look at the following papers.

Originally published at on October 29, 2020.

Written by

International Public Speaker, Writer, and Black Hat Review Board Member. Head of Cybersecurity Research @ Kudelski Security.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store