This note is divided into 4 section.
 Section1: What is the meaning of ‘deeplearning from foundations?’
 Section2: What’s inside Pytorch Operator?
 Section3: Implement forward&backward pass from scratch
 Section4: Gradient backward, Chain Rule, Refactoring
What’s inside Pytorch Operator?
Section02
Time comparison with pure Python

Matmul with broadcasting
> 3194.95 times faster 
Einstein summation
> 16090.91 times faster 
Pytorch’s operator
> 49166.67 times faster
1. Elementwise op
1.1 Frobenius norm
 above converted into
(m*m).sum().sqrt()
 Plus, don’t suffer from mathmatical symbols. He also copy and paste that equations from wikipedia.
 and if you need latex form, download it from archive.
2. Elementwise Matmul
 What is the meaning of elementwise?

We do not calculate each component. But all of the component at once. Because, length of column of A and row of B are fixed.
 How much time we saved?
 So now that takes 1.37ms. We have removed one line of code and it is a 178 times faster…
#TODO
I don’t know where the 5
from. but keep it.
Maybe this is related with frobenius norm…?
as a result, the code before
for k in range(ac):
c[i,j] += a[i,k] + b[k,j]
the code after
c[i,j] = (a[i,:] * b[:,j]).sum()
To compare it (result betweet original and adjusted version) we use not test_eq but other function. The reason for this is that due to rounding errors from math operations, matrices may not be exactly the same. As a result, we want a function that will “is a equal to b within some tolerance”
#export
def near(a,b):
return torch.allclose(a, b, rtol=1e3, atol=1e5)
def test_near(a,b):
test(a,b,near)
test_near(t1, matmul(m1, m2))
3. Broadcasting
 Now, we will use the broadcasting and remove
c[i,j] = (a[i,:] * b[:,j]).sum()
 How it works?
>>> a=tensor([[10,10,10],
[20,20,20],
[30,30,30]])
>>> b=tensor([1,2,3,])
>>> a,b
(tensor([[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]),
tensor([1, 2, 3]))
>>> a+b
tensor([[11, 12, 13],
[21, 22, 23],
[31, 32, 33]])
 <Figure 2> demonstrated how array b is broadcasting(or copied but not occupy memory) to compatible with a. Refered from numpy_tutorial

there is no loop, but it seems there is exactly the loop.

This is not from jeremy (actually after a moment he cover it) but i wondered How to broadcast an array by columns?
c=tensor([[1],[2],[3]])
a+c
tensor([[11, 11, 11], [22, 22, 22], [33, 33, 33]])s
 What is tensor.stride()?
help(t.stride)
Help on builtin function stride:
stride(…) method of torch.
Tensor instance
stride(dim) > tuple or int
Returns the stride of :attr:’self’ tensor.
Stride is the jump necessary to go from one element to the next one in the specified dimension :attr:’dim’.
A tuple of all strides is returned when no argument is passed in.
Otherwise, an integer value is returned as the stride in the particular dimension :attr:’dim’.
Args: dim (int, optional): the desired dimension in which stride is required Example::*
x = torch.tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])`
x.stride()
>>> (5, 1)
x.stride(0)
>>> 5
x.stride(1)
>>> 1

unsqueeze & None index
 We can manipulate rank of tensor
 Special value ‘None’, which means please squeeze a new axis here
== please broadcast here
c = torch.tensor([10,20,30])
c[None,:]
 in c, squeeze a new axis in here please.
2.2 Matmul with broadcasting
for i in range(ar):
# c[i,j] = (a[i,:]). *[:,j].sum() #previous
c[i] = (a[i].unsqueeze(1) * b).sum(dim=0)
 And Using
None
also (As howard teached)
c[i] = (a[i ].unsqueeze(1) * b).sum(dim=0) #howard
c[i] = (a[i][:,None] * b).sum(dim=0) # using None
c[i] = (a[i,:,None]*b).sum(dim=0)
⭐️Tips🌟
1) Anytime there’s a trailinng(final) colon in numpy or pytorch you can delete it
ex) c[i, :] = c [i]
2) any number of colon commas at the start, you can switch it with the single elipsis.
ex) c[:,:,:,:,i] = c […,i]
2.3 Broadcasting Rules
 What if we
tensor.size([1,3]) * tensor.size([3,1])
?torch.Size([3, 3])
 What is scale????

What if they are one array is
times
of the other array?
ex)Image : 256 x 256 x 3
Scale : 128 x 256 x 3
Result: ?
 Why I did not inserted axis via None, but happened broadcasting?
>>> c * c[:,None]
tensor([[100., 200., 300.],
[200., 400., 600.],
[300., 600., 900.]])
maybe it broadcast cz following array has 3 rows
as same principle, no matter what nature shape was, if we do the operation tensor broadcasts to the other.
>>> c==c[None]
tensor([[True, True, True]])
>>> c[None]==c[None,:]
tensor([[True, True, True]])
>>>c[None,:]==c
tensor([[True, True, True]])
3. Einstein summation
 Creates batchwise, remove inner most loop, and replaced it with an elementwise product a.k.a
c[i,j] += a[i,k] * b[k,j]
inner most loop
c[i,j] = (a[i,:] * b[:,j]).sum()
elementwise product
 Because K is repeated so we do a dot product. And it is torch.
Usage of einsum() 1) transpose 2) diagnalisation tracing 3) batchwise (matmul)
…
 einstein summation notation
def matmul(a,b): return torch.einsum('ik,kj>ij', a, b)
so after all, we are now 16000 times faster than Python.
4. Pytorch op
49166.67 times faster than pure python
And we will use this matrix multiplication in Fully Connect forward, with some initialized parameters and ReLU.
But before that, we need initialized parameters and ReLU,
Footnote
Resources
 Frobenius Norm Review
 Broadcasting Review (especially Rule)
 Refer colab! (I totally confused with extension of arrays)
 torch.allclose Review
 np.einsum Review
h