Note of lesson6 : Deep Learning Part 1 of Spring 2020 at USF
Today we will study
 Image Classification
 Multilabel classification
 Collaborative filtering
CONTENTS
Previous; Recap of last class
Image Classification
Model Interpretation: Course link, Book link
 When you have many categories, use
most_confused()
method rather than plotting confusion matrix ^{1}
Improving our model
Learning rate finder
 We can try to learn fast which is done by epoch 1~2 and set learning rate big
learn.fine_tune()
hasbase_lr
parameter If learning rate is too high, loss validation gets too high compared to loss of train
 why? because it diverges
 If learning rate is too small, train loss decreases too slowly, and there are much gap between train and valid loss. ^{2}
 So it will take long time, means too many epochs, resulting overfitting
 meaning of
learning rate
~loss
graph find optimal lr drawing loss depends on learning rate
 careful it’s logarithm scale
 this method was invented at 15 and before that people just experimented
lr_find()
goes through one sing minibatch? J: no, it’s just working through dataloader, and diff is we try many lrs
why steepest? why not minimum? J: at the minimum we don’t learn anymore. we want our model to enhance learning while training
R: About lr_find(), it’s natural that it seems too simple
Unfreezing and transfer learning
 we throw away last layer, and retrain depends on our data/task
 And the below function is easy way when we do just finetune
learn.find_tune??
 calling
cnn_learner
freeze our model, so don’t have to do it separately  But when you do unfreeze, which means you train all the params, you need to adjust the learning rate more.
 We can do better because we just used the same learning rate for a whole training
Discriminative learning rates
 use
slice
at lr_max parameter means discriminative learning rate ^{3}
Selecting the number of epochs

If you choose too big epoch, valid_loss will not change from specific point, and this is related to
fit_one_cycle
 What
fit_one_cycle
does? different from just
fit
.  start purposely from low learning rate. and from the point of highest point, they decrease lr again
 if error_rate stop from specific epoch, use that epoch as epoch and do it again
 different from just
 Loss is just a thing we want to approximate, so always care
error
Deeper architectures
 It would be possible that you don’t want pretrained but usually it’s helpful
 usually out of memory happens
 about,
.to_fp16()
:half precision floating point
, NVIDIA GPU support special tensor and it’s faster about 2x/3x  you can use this using module
from fastai2.callback.fp16 import *
 about,
 Try a small model before scaling up the model < because bigger model doesn’t guarantee better performance
Assignment
 Don’t forget the questionnaire
 Read paper, Cyclical Learning Rates for Training Neural Networks and see how you can get best results with adjusting learning rate, find out best learning rate by yourself!
 of course you can see fast.ai learning rate source code
How do you know you can do better? J: you don’t know, who knows. just try and experiment
about
.to_fp16()
it doesn’t affect result? J: less bumpy, little bit better but not that big deal
question about random J: not does machine learning but deep learning specifically. For more info, see Rachael’s Linear Algebra course.
Multilabel classification
Constructing a data block
 pandas, which is for dataframe
 when you slicing,
:
is always optional  Read
Python for Data Analysis
class written by person who made pandas
 when you slicing,
 dataset is abstract idea of class
 If you get index, and length, it’s dataset.
string.ascii_lowercase
is qualified dataset since it has length and index If you index it mostly get tuple, independent variable and dependent variable.
 Datasets is usually composed of not stack of data and label
 fast.ai, we implemented use file name as independent variable, and through function we get dependent variable

Dataloader divides Datasets to batches, read the book
 Datablock assumes basically we have 2 kind of variable
 randomly select different validation set so we have different output whenever we call dataset
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
get_x = get_x, get_y = get_y)
dsets = dblock.datasets(df)
dsets.train[0]
(PILImage mode=RGB size=500x375,
TensorMultiCategory([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]))
PILImage mode=RGB size=500x375
: independent variable, which is a input we did transform filename to tensors
 output is onehot encoding
Shouldn’t it be an integer but why output(
TensorMultiCategory([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]))
) is float? J: because we will use crossentropy style loss, which does floating calculation
why we don’t use with floating point 8 bit? J: could be fast, but low precision (but maybe possible)
Binary cross entropy
learn = cnn_learner(dls, resnet18)
x,y = dls.train.one_batch()
activs = learn.model(x)
activs.shape
 you can try grapping the one batch of the data to the model, you can get an output.
 Use a
dls.train.one_batch()
function to ensure what’s going in and out
 Use a
def binary_cross_entropy(inputs, targets):
inputs = inputs.sigmoid()
return torch.where(targets==1, inputs, 1inputs).log().mean()
 result is 0 to 1 but we get a variable between 0 and 1 so that we use sigmoid

same reason as we said at softmax, normally we use
log
(fast and acurate) 
explained the book
 Onehot encoded target, use
BCELoss
 accuracy as a metric only works for single label dataset
 because accuracy does
argmax
which chooses largest number
 because accuracy does
 so doing multicategorical problem, use metrics which can compare each activation with some threshold(not selecting one value), and pass if the number over threshold
 but we might want to change threshold depends on the input, so that we use
partial function
which sets default variable, and change if we want
 but we might want to change threshold depends on the input, so that we use
learn = cnn_learner(dls, resnet50, metrics=partial(accuracy_multi, thresh=0.2))
 the accuracy changes if I use different threshold
 How can I get best threshold?
preds,targs = learn.get_preds()
xs = torch.linspace(0.05,0.95,29)
accs = [accuracy_multi(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs);
 we can plot and select the best number and don’t have to care of overfitting (because it’s not that bumpy as rule of thumb)
Regression
 two types of task at supervised learning 1) classification(discrete variable) 2) regression(continuous variable)
Assemble the data
get_image_files
gather image recursively here, it important to setting evaluation to a specific person using
splitter
, not randomly selection > it’s continuous frame, so if you just select random data, it would be overestimated.  pytorch has tensor
batch
,channel
,height
,width
 regarding channel, image has 3 channels
 you can see the version of
R
,G
,B
using the below code
 you can see the version of
im = image2tensor(Image.open('image/grizzly.jpg'))
_, axs = subplots(1,3)
for bear, ax, color in zip(im, axs, ('Reds', 'Greens', 'Blues')):
show_images(255bear, ax=ax, cmap=color)
can we give multichannel bigger than 3? J: yes, it used often like when you use satellite image. But your pretrained model is usually for a 3channel and fastai handles that case, but is is just problem of axis
Training a mode
learn.show_results(ds_idx=1, max_n=3, figsize=(6,8))
 left of output is target, right is prediction
 Notice It’s interesting that basically our pretrained model was trained for classification task. But it works well when you do regression
 why? pretrained model from ImageNet is collection of image’s features like, objects, color, texture, shadow and so on
 so from pretrained model, we can get the features of an image, and do the other job(e.g., regression) with finetuning
Assignment
 go to the bear classifier, and find out when you give image which is not included in label and see what happens. change model from single labelling to multilabelling, and tell what happened at fastai forum
at traditional machine learning we used crossvalidation, we don’t use it at deep learning? J: First, nowadays crossvalidations are less common(if not on kaggle, because praction of data is important), because crossvalidation was used there were not lots of data. Second, crossvalidation is not related selection between machine learning and deep learning
CF algorithm used other than recommendation system J: whatever it’s kind of recommendation system, if concept is finding out the other candidates using the previous behaviour
how can i make data set if I have video? how split?(time phrase) J: like full color movie, rank5 tensor by (time, h, w, batch, channel). but usually it would be 4 tensor, because of size of tensor or key problem, but theoretically possible.
Collaborative filtering
 empty part means user didn’t watch or rate the movie
 upper and below part represent same info
Learning the latent factors
 CF algorithm is related to latent variable(;factors) since we don’t have a label explaining/depicting properties which made you choose specific movie
 In other words, there is no specific label how we could predict the dependent variable(recommendation; movies user didn’t watch) results from independent var(rating regarding user and movie)
 but we can assume it has labels like genres and make matrix representing it
 randomly create user’s factor(;latent, hidden) and movie’s factor. and do matrix multiplication with those, which is dot product.
 and sum it and compare with target, so that you can learn parameters, which were randomly initialized
Creating the DataLoaders

Let’s make DataLoaders using basic python and PyTorch!
 user and movie is represented using index, label,..,called look up in an index
 but this is problem, because DL treats only kind of matrix multiplication.
 so that we make onehot encoding matrix
 And this is quite amazing because it came out with any kind of discrete value, we can make it work using arrays
how is different when you treat actual data between dense and sparse data? J: we will not treat sparse data, but there’s course Rachael’s linear algebra
 embedding layer: multiplying by a one hot encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly.
 there is excel explaining detailed computation
 This is important because usually people think embedding is something difficult, but it’s just computational shortcut to do matrix multiplication
Collaborative filtering from scratch
 importance of dunder method of python. (like
__init__
) explained OOP concept
 Inheritance of python, which used DotProduct class
 Module is fast.ai version of pytorch’s nn.Module
forward
function grab embedding, using specific index (again) Remember Embedding is just kind of shortcut to make a onehot encoding matrix
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return (users * movies).sum(dim=1)
 64: size of mini_batch, 2: index
x,y = dls.one_batch()
x.shape
torch.Size([64, 2])

We use bias to represent specific character of each user and movie, because embedding of user/movie doesn’t show users’ character, who might generally doesn’t like movie

what should we do when my model is overfitting very quickly with small epoch
 i.e., what should we do to evade overfitting and don’t want to lessen epoch?
 augmentation can be one solution but we can’t do augmentation with this data
 in this case, we do other regularization, which means panellize the fast learning ^{4}
Collaborative filtering algorithm works better than svd or kind of that? J: yes
My Questions