This assignment is about using convolutional neural networks for image classification. You will implement, design and train deep convolutional networks
for scene recognition using PyTorch, an open source deep learning platform.
Moreover, you will take a closer look at the learned network by (1) identifying important image regions for the classification and (2) generating adversarial
samples that confuse your model. This assignment is team-based. A team can
have up to 3 students.
• Install Anaconda. We recommend using Conda to manage your packages.
• The following packages are needed: PyTorch (1.0.1 with GPU support),
OpenCV3, NumPy, Pillow and TensorboardX. And you are in charge of
• For the visualization of the results, you will need Tensorboard and TensorFlow (a dependency of Tensorboard). You don’t need TensorFlow-gpu
in this case.
• You can debug your code and run experiments on CPUs. However, training a neural network is very expensive on CPUs. We recommend using
GPU computing for this project. Please setup your team’s cloud instance.
Do remember to shutdown the instance when it is not used!
• You will need to download the MiniPlaces dataset for Part II & III of the
project. We have included the downloading script. Run download dataset.sh
in the assignment folder. All data will be downloaded under ./data/.
• You will need to fill in the missing code in:
• You will need to submit your code, results and a writeup. You can generate
the submission once you’ve finished the assignment using:
python ./zip submission.py
This assignment has three parts. An autograder will be used to grade some
parts of the assignment. Please follow the instructions closely.
3.1 Understanding Convolutions
In this part, you will need to implement 2D convolution operation–the fundamental component of deep convolutional neural networks. Specifically, a 2D
convolution is defined as
Y = W ∗S X + B (1)
• Input: X is a 2D feature map of size Ci × Hi × Wi (following PyTorch’s
convention). Hi and Wi are the height and width of the 2D map and Ci
is the input feature channels.
• Weight: W defines the convolution filters and is of size Co ×Ci × K × K,
where K is the kernel size. For this part, we only consider squared filters.
• Stride: ∗S is the convolution operation with stride S. S is the step size
of the sliding window when W convolves with X . For this part, we only
consider equal stride size along the height and width. W is the parameter
that will be learned from data.
• Bias: B is the bias term of size Co. b is added to every spatial location
H × W after the convolution. Again, B is the parameter that will be
learned from data.
• Padding: Padding is often used before the convolution. Again, we only
consider equal padding along all sides of the feature map. A (zero) padding
of size P adds zeros-valued features to each side of the 2D map.
• Output: Y is the output feature map of size Co × Ho × Wo, where Ho = Hi+2P −K
+ 1 and Wo =
Helper Code: We have provided you helper functions for the implementation
(./code/student code.py). You will need to fill in the missing code in the class
CustomConv2DFunction. You can use the fold / unfold functions and any
matrix / tensor operations provided by PyTorch, except the convolution functions. You do not need to modify the code in the class CustomConv2d. This
is the module wrapper for your code.
Requirements: You will need to implement both the forward and backward
propagation for this 2D convolution operation. The implementation should work
with any kernel size K, input and output feature channels Ci/Co, stride S and
padding P. Importantly, your implementation need to compute Y given input X
and parameters W and B, and the gradients of ∂Y
∂W and ∂Y
. All derivations
of the gradients can be found in our course material, except ∂Y
your write up, please describe your implementation.
Testing Code: How can you make sure that your implementation is correct?
You can compare your forward / backward propagation results with PyTorch’s
own Conv2d implementation. You can also compare your gradients with the
numerical gradients. We included a sample testing code in ./code/test conv.py.
Please make sure your code can pass the test.
3.2 Design and Train a Convolutional Neural Network
In the second part, you will design and train a convolutional neural network for
scene classification on MiniPlaces dataset.
MiniPlaces Dataset: MiniPlaces is a scene recognition dataset developed by
MIT. This dataset has 120K images from 100 scene categories. The categories
are mutually exclusive. The dataset is split into 100K images for training, 10K
images for validation and 10K for testing. You can download the dataset by
running download dataset.sh in the assignment folder. The images and annotations will be located under ./data. We will evaluate top-1/5 accuracy for the
performance metric. For more details about the dataset, please refer to their
github page https://github.com/CSAILVision/miniplaces.
Helper Code: We have provided you helper code for training and testing a
deep model (./code/main.py). You will have to run this script many times but
you are unlikely to modify this file. For your reference, a simple neural network
is implemented by the class SimpleNet in ./code/student code.py. You will
need to modify this class for this part of the project.
Requirements: You will design and train a deep network for scene recognition.
You model must be trained from scratch using the training set. No other source
of information is allowed, e.g., using labels of the validation set for training, or
using model parameters that are learned from ImageNet. This part includes 4
• Section 0: Let us start by training our first deep network from scratch!
You do not need to write any code in this section–we provide the dataloader and a simple network you can use. You can start by running
python ./main.py ../data
You will need to use GPU computing for this training. And it will take a
few hours and give you a model with around 47% top-1 accuracy on the
validation set. Do remember to put your training inside a container, e.g.,
tmux, such that your process won’t get killed when you SSH session is
disconnected. You can also use
watch -n 0.1 nvidia-smi
to get a rough estimation of GPU utilization and memory consumption.
Once the traininng in complete, your best model will be saved as ./models/model best.pth.tar. You can evaluate this model by
python ./main.py ../data –resume=../models/model best.pth.tar -e
• Section 1: While waiting for the training of the model, you can read the
code and understand the training. Please describe the training process
implemented in our code in your writeup. You should address the following
questions: Which loss function/optimization method is used? How is the
learning rate scheduled? Is there any regularization used? Why is top-K
accuracy a good metric for this dataset?
• Section 2: Let us try to use our own convolution to replace PyTorch’s
version and train the model for 10 epochs. This can be done by python
./main.py ../data –epoches=10 –use-custom-conv
How is your implementation different from PyTorch’s convolution in terms
of training memory, speed and convergence rate? Why? Describe your
findings in the writeup.
• Section 3: Now let us look at our simple network. The current version
is a combination of convolution, ReLU, max pooling and fully connected
layers. Your goal is to design a better network for this recognition task.
There are a couple of things you can explore here. For example, you can
add more convolutional layers , yet the model might start to diverge in
the training. This divergence can be avoided by adding residual connections  and/or batch normalization . You might also want to try the
multi-branch architecture in Google Inception networks . You can also
tweak the hyper-parameters for training, e.g., learning rate, weight decay,
training epochs etc. These hyper-parameters can be passed to main.py in
the terminal. You should implement your network in student code.py and
call main.py for training. Please justify your design of the model and/or
the training, and present your results in the writeup. These results include
all training curves and training/validation accuracy.
Monitoring the Training: All intermediate results during training, including
training loss, learning rate, train/validation accuracy are logged into files under
./logs. You can monitor and visualize these variables by using
We recommend copying the logs folder to a local machine and use Tensorboard
locally for the curves. Thus, you can avoid to setup a Tensorboard server on
the cloud. Please include the curves of your training loss and train/val accuracy in your writeup. Do these curves look normal to you? Please provide your
discussion in the writeup.
[Bonus] MiniPlaces Challenge: You can choose to upload your final model
and thus participate our MiniPlaces challenge. This challenge will be judged by
evaluating your model on a hold-out test set. If you decided to do so, please
copy your model best.pth.tar to results folder. To make this challenge a bit more
challenging, we do have some constraints for your model. First, your model has
to be trained under 4 hours using a K40 GPU on the cloud. We do not have a
way to strictly enforce this rule, yet please keep this number in mind. Second,
your model (tar file) size has to be smaller than 10MB. As a point of reference,
our SimpleNet is only 5.5MB with a top-1 accuracy of 47%. Teams that are
ranked top 3 in this challenge will received 2 bonus points (out of the 15pt for
this homework assignment). We encourage you to take this challenge.
3.3 Attention and Adversarial Samples
In the final part, we will look at attention maps and adversarial samples. They
present two critical aspects of deep neural networks: interpretation and robustness, and thus will help you gain insight about these networks.
Helper Code: Helper code is provided in ./code/main.py and student code.py
for visualizing attention maps and generating adversarial samples. For attention
maps, you will need to fill in the missing code in class GradAttention. And
for adversarial samples, you need to complete the class PGDAttack.
Requirements: You will implement methods for generating attention maps
and adversarial samples
• Attention: Suppose you have a trained model. If you minimize the loss of
the predicted label and compute the gradient of the loss w.r.t. the input,
the magnitude of a pixel’s gradient indicates how important that pixel is
for the decision. You can create a 2D attention map by (1) computing
the input gradient by minimizing the loss of the predicted label (most
confident prediction); (2) taking the absolute values of the gradients; and
(3) pick the maximum values across three channels. This method was
discussed in . Once you finished the coding, you can run
python ./main.py ../data –resume=../models/model best.pth.tar -e -v
This command will evaluate your model using your trained model (assuming model best.pth.tar) and visualize the attention maps. All attention
maps will saved under ./logs. Again you can use Tensorboard
Now you will see a tab named “Image”. And you can scroll the slide
bar on top of the image to see samples from different batches. You can
also zoom in the image by clicking on it. Please include and discuss the
visualization in your writeup.
• Adversarial Samples: Interestingly, if you you minimize the loss of a
wrong label and compute the gradient of the loss w.r.t. the input, you
can create adversarial samples that will confuse the model! This was first
presented in . Let us use the least confident label as a proxy for the
wrong label. And you will implement the Projected Gradient Descent
in . Specifically, PGD takes several steps of fast gradient sign method,
and each time clip the result to the -neighborhood of the input. You will
need to be a bit careful for this implementation. You do not want PyTorch
to record your gradient operations in the computation graph. Otherwise,
it will create a graph that grows indefinitely over time. Again, you can
call main.py once you complete the implementation
python ./main.py ../data –resume=../models/model best.pth.tar -a -v
This command will generate adversarial samples on the validation set
and try to attack your model. And you can see how the accuracy drops
(significantly!). Moreover, adversarial samples will be saved in the logs
folder. And you can use Tensorboard to check them. This time, you
will find tabs “Org Image” and “Adv Image”. Can you see the difference
between the original images and the adversarial samples? Please discuss
your implementation of PGD and present the results (accuracy drop and
adversarial samples) in your writeup.
[Bonus] Adversarial Training: A deep model should be robust under adversarial samples. A possible solution to build this robustness is using adversarial
training, as described in [1, 4]. The key idea is to generate adversarial samples and feed these samples into the network during training. To implement
adversarial training, you can attach your PGD to the forward function in the
SimpleNet (See the comments in the code for details). Unfortunately, this training can be 10x times more expansive than a normal training. To accelerate this
process, you can (1) reduce the number of steps in PGD and (2) reduce the
number of epochs in training. Your goal is to show that in comparison to a
model using normal training, your model using adversarial training has a better
chance to survive adversarial attacks. Please discuss your experimental design,
implementation and results in the writeup. Your team will received a maximum
of 2 bonus points (out of the 15pt for this homework assignment).
For this assignment, and all other assignments, you must submit a project report
in PDF. Every team member should send the same copy of the report. Please
clearly identify the contribution of all the team members. In the report you will
describe your algorithm and any decisions you made to write your algorithm a
particular way. Then you will show and discuss the results of your algorithm. In
the case of this project, we have included detailed instructions for the writeup
in each part of the project. You can also discuss anything extra you did. Feel
free to add any other information you feel is relevant. A good writeup doesn’t
just show results, it tries to draw some conclusions from your experiments.
5 Handing in
This is very important as you will lose points if you do not follow instructions.
Every time after the first that you do not follow instructions, you will lose 5%.
The folder you hand in must contain the following:
• code/ – directory containing all your code for this assignment
• writeup/ – directory containing your report for this assignment.
• results/ – directory containing your results. Please include your model if
you decide to participate in our challenge.
Do not use absolute paths in your code (e.g. /user/classes/proj1).
Your code will break if you use absolute paths and you will lose points because
of it. Simply use relative paths as the starter code already does. Do not turn
in the data / logs / models folder. Hand in your project as a zip file through
Canvas. You can create this zip file using python zip submission.py.
 I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing
adversarial examples. In ICLR, 2015.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In CVPR, 2016.
 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In ICML, 2015.
 A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep
learning models resistant to adversarial attacks. In ICLR, 2018.
 K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional
networks: Visualising image classification models and saliency maps. In
 K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In