Part1 lesson7 | fastai 2019 course-v3

Follow Apr 29, 2020 · 9 mins read
Part1 lesson7 | fastai 2019 course-v3
Share this

Note of lesson7 : Deep Learning Part 1 of Spring 2019 at USF, version 3 course

Today we will study lots of things so don’t be frustrated.

  1. ResNets
  2. Unets
  3. GANs
  4. RNNs


@reshama at fastai forum, who deployed app using fastai both of ios and android version.

(#f) what do learn


data block API

pillow’s convert mode L, which is grey scale

items attribute is kind of the thing we get from data

binary color map, refer to matplotlib for more info

remember that pytorch put channel first

you have to say how to split (when using fastai lib)

validation is (of course) has labels

1) have itemlist 2) split it 3) label it

y are category object for now mnist


  • kind of digit, you don’t want to rotate or flip since it will change the digit’s meaning

So Jeremy added random crop and added padding

[]: empty is transforms for validation set, no need for transform for validation set.

we don’t use pre-trained model, so we don’t need image net stats. (#f)

plot_multi can show you results of calling some function, which grab the dataset, how many transformed version we create?: infinite, each time we call it, we grab the new transformed dataset.

(batch_size, channel, row, col)

Basic CNN with batchnorm

nn.Conv2d(ni, nf, kernel_size = 3, stride=2, padding=1)

how many channels you want? -> as much as you want


(1,8) -> one channel coming in, 8: how many filters you want to come out

conv(ni, nf) <- ni, nf is input / output channel

so as a result, 10 by 1 by 1 (which means grid size is 1)

careful of the comment, which represents grid size

it decreases by half since stride = 2


exactly the same neural net.

how can we improve this?


Deep Residual Learning for Image Recognition - skip connection, identity, res block

  • Motivation of the paper: 56 layer is way more worse than the 20-layer(shallow) network.
  • Idea: keep the 56 layers but make results identical with shallow version
  • How: a building block, every two convolutions, add the results together with input

and this became legendary architecture

Visualising the Loss Landscape of Neural Nets

  • x, y - weight space, y - loss
  • without identity connection, it’s very bumpy and with res connection, and this is little bit similar with batch norm.

except the last one, jeremy added res block without last layer.

don’t forget to refactor your code

.9954, and NIPS 2015 the best paper was 0.45% error - sota changes so fast


SequentialEx : Sequential extended (#f)

and real code, we don’t add, but concat. and this is dense net.

  • memory intensive because how deep you go in, you have original data, but very little parameter (#q) (#f)

really work well for segmentation

(#f) re-building the modern architecture, and he keep trying to show …..(#f)


in order to color cluster, it has to know what that it is

A guide to convolution arithmetic for deep learning - great picture show what does 3 by 3 half stride conv looks like.

U-net, down sampling path. size keep decreasing, channels keep increasing. And at last the grid size came back to same.

Then, how do we double the grid size?

stride - half convolution -> transposed convolution, deconvolution,

and this is maybe 1, 2 years ago. why?

  1. nearly all of the pixel is zero
  2. different amount of information to the different convolution path (see the paper, and understand myself)

Solution: (1) Nearest neighbour interpolation

  • just copy twice, no zero, no computation.

(2) bilinear interpolation - weighted average (this is pretty standard with picture)

/assets/images/lesson03-1.png - kind of skip connection but not added to the every 2 conv block, but to the same stage of the downsampling to the same stage of the upsampling

see the code how fastai implemented this

encoder refers to the downsampling part, which is substituted with resnet

unet_block is connection between the up-sampling and down-sampling

class UnetBlock

they grab the hook, to use resnet, and up_out using up-sampling

each time you see two convolutions like conv2(conv1()), you can think of what if I add the res block? probably you would get a better results.

identity connection is done with 1) add 2) concat

Q. why call concat before conv1, conv2 and not after? (#q - see the code) / J: there’s no way to interact with channel (???) remember if you do 2 conv, it’s actually 3d, (#f)

Q. size of image feature layer changes, how does dense net work since it concatenate. / J: ?? (#f) 53:00:00


will use to another task, image restoration, which starts from the image but will make better image, like black-white to color, resolution, .. (#f)

Crappified data

paraLLeL(crappify, il.items)

fastai’s function which does crappification(it takes time quite much)

we gonna use, pre-trained model, since we added some noise and we have to catch what original picture was., which is watermark removal

use MESLoss, and it normally expects two vectors, which can compare so we flat version.

pre-trained part of unet is down-sampling part.


Motivation of GAN: Our loss function doesn’t represent what we want.

  • we are missing the texture of pillow, blanket

-> the emergent of GAN. it uses another models(Critic, discriminator) for loss function

t = 3761 » Jeremy explaination of GAN

cons: if you don’t have pre-trained Critic/Generator, it’s kind of blind vs blind

Save generated images

let’s create Critics

(#q) reconstruct = True

we will learn more of what’s going on inside at part2

if you don’t want to re-initiate the gpu ram, you can gc.collect() which will remove caching gpu.

we will use Binary Cross Entropy but will not use resnot

  • Because of weight could be increased much that we can’t handle that, but will learn at part2

we should be careful when using gan, loss which uses adaptive loss

we not only use the critic to loss function, but also MSE loss because if we use only the Critics as loss function, generator could make irrelevant photo of the original image.

GANs hates momentum. - but nobody figured it out.

tough things of training gan - loss number is meaningless. / so you have to see the results from time to time

(original craffied picture / generated picture / original picture)

Q. What kind of task you will not use U-net? / J: Unet is used when size of output is aligned to the input. so any kind of generative modelling and segmentation is also generative modelling because we should out pixel segments which is responding to original pixel. and it would be absurd if we use unet to the classification, because we just need the down-sampling process.


(skipped so fast…)

interesting because we use bedrooms dataset

the approach that we use is ‘can we create a bedroom?’

we feed the generator random noise

and generator should return the de-noised bedroom.


Perceptual Losses for Real-Time Style Transfer and Super-Resolution

doesn’t like the term ‘perceptual’ - jeremy prefers to say it is feature losses

down-sampling part is called encoder, and vice versa

same color represents same grid size

(#f) explained idea but could not get. - t=5022

pick the l1 loss or mse ( no matter)

grad is false since we will use only the loss

class featureloss is the implementation of the paper

(abandoned to write -> too hard…. and too fast)

upsize medium res data, and you can see the pixels enlarged focus.

results is quite good


  • the student of, made colored picture who had the course.

(#f) and also said lots of idea related to this. we can new way of Crappification.

Q. how about using unet or gan to nlp, kind of generating shakespear’s novel. / J: possible, and it’s quite blue ocean, and will look thorough little bit at part2

Be sure not to understand at once, go through this at least 3 times or more


  • square - input / arrow - layer / circle - activation / triangle - output

go to the learn.summary and see the specifics

we will do mat mul and non-linearity one(or more) more, but (??) will use same weight.

bptt - back prop through time

13017 / 70 / bs - all dataset divide first by batch size, and after that, divide by length limit (but why??????)

every mini-batch joins up with previous mini-batch (that’s why he showed textify)

how we convert the diagram to the code

rnn is just a refactoring, there’s nothing new.

-> but only just predicting last word is wasteful, so we will predict every word, which means just triangle comes inside the model.

maintain state

Let’s keep the hidden state.

(rnn is totally normal nn but just refactored)


this time we will render the output the other loop, which is nn.RNN

bptt 20 means 20 layers of the diagram.

2-layer GRU

there are small nn which decide how much apply the nodes.

GRU/lstm is depended on specifics