This is a series of useful PyTorch tricks inspired by vainaijr in his YouTube channel.
This notebook is an implementation of all these techniques and is designed in a way to best demonstrate their usefulness.
Visualization model using torchsummaryX
import torch
import torchvision.models as models
from Utils import *
Here we will build a Single-shot-detection model with just 20 classes.
# Create SSD300 with pretrained weights in the base-architecture
n_classes = 20
model = SSD300(n_classes)
# install torchsummaryX
!pip install torchsummaryX
from torchsummaryX import summary
summary(model, input)
takes our intentional model and a pseudo input with the correct shape
# pseudo input of batch size = 3, num_channel = 3, pixel: 300x300
summary(model, torch.zeros((3,3,300,300)))
Final Note: Normally, if we use architectures directly from TorchVision
or Keras
we would have nice model summary just like this.
This libarary is particular useful when we want to inspect user people's model or a verions that we have modified besed on commonly used models like the example above.
In addition, we have a nice visualization of num of parameters & output demension for each layer which is kind of nice for debugging your own model or simply for reference.
PyTorch Hooks
PyTorch hook is a tool that we can register to any tensor or nn.Module during our computation so that we can monitor what is going on with our forward
and backward
loops.
The forward
is not refered to nn.Module.forward
bu the torch.Autograd.Function
object that is the grad_fn
of a tensor.
Notice, that a nn.Module
like nn.Linear
can have multiple forward
invocations. It's output is created by two operations, $Y = W*X+B$, addition and multiplication and thus there will be two forward
calls.
A forward hook is excuted during the forward pass, while the backward hook is executed when backward
function is called both of which are functions of Autograd.Funciton
object.
A hook in PyTorch is basically a function, with a very specific signature. When we say a hook is executed, in reality, we are talkingabout this function being executed.
grad
is basically the value contained in the grad
attribute of the tensor after backward
is called. The function is not supposed to modify it's argument. It must either return None
or a Tensor which will be used in place of grad
for further gradient computations.
The below example clarifies this point:
import torch
a = torch.ones(10)
a.requires_grad
a.requires_grad = True
a.requires_grad
b = 2*a
b.requires_grad
print(a.is_leaf)
print(b.is_leaf)
Since b
is not a leaf Variable, its grad
will by degault be destroyed during computation.
We can used b.retain_grad()
to ask PyTorch to retain its grad
b.retain_grad()
c = b.mean()
print(f"requires_grad: {c.requires_grad}")
print(f"is_lead: {c.is_leaf}")
# pretend c is the loss being computed
c.backward()
print(a.grad, b.grad)
Now we redo the experiment but with a hook that multiplies b
's grad by 2
a = torch.ones(10)
a.requires_grad = True
b = 2*a
b.retain_grad()
b.register_hook(lambda x:print(x))
b.mean().backward() # pretend the mean of b is the loss we want to back-prop
Here we can see that, the print out is exactly the same result by using hook on b
, and the lambda
function automatically take the b.grad
as input.
This gives us a sense that hook is tracking.
print(a.grad, b.grad)
There are several uses of functionality as above:
non-leaf
variables whose gradients are freed up unless we perform retain_grad
upon them. Doing the latter can lead to increased memory retention. Hooks provide much cleaner way to aggregate these values.grad
variable of a tensor in a network, we can only access it after the entire backward pass has been processed. For example, we multiplied b
's gradient by 2, and now the subsequent gradient calculations, like those of a
(or any tensor that will depend upon b
for gradient) used 2*brad(b)
instead of grad(b)
. In contrast, had we individually updated the parameters after the backward
, we'd have to multily b.grad
as well as a.grad
# to demonstrate
a = torch.ones(10)
a.requires_grad = True
b = 2*a
b.retain_grad()
b.mean().backward()
print(a.grad, b.grad)
b.grad *= 2
print(a.grad, b.grad) # Note that in this case, a's grad needs to be updated mannually
For backward hook:
hook(module, grad_input, grad_output)
For forward hook:
hook(module, input, output)
import torch.nn as nn
class myNet(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3,10,2, stride=2) # (8-2+0)/2+1 = 4
self.relu = nn.ReLU()
self.flatten = lambda x: x.view(-1)
self.fc1 = nn.Linear(160,5)
def forward(self, x):
x = self.relu(self.conv(x))
return self.fc1(self.flatten(x))
Net = myNet()
summary(Net,torch.zeros(1,3,8,8))
def hook_fn(m,i,o):
print(m)
print("---------Input Grad----------")
for grad in i:
try:
print(grad.shape)
except AttributeError:
print("None found for input Gradient")
print("--------Output Grad----------")
for grad in o:
try:
print(grad.shape)
except AttributeError:
print("None found for output Gradient")
print("\n")
Net.named_modules
Net.conv.register_backward_hook(hook_fn)
Net.fc1.register_backward_hook(hook_fn)
inp = torch.rand(1,3,8,8)
out = Net(inp)
out
# pretend we have the following as loss
(1-out.mean()).backward()
Note that, the Linear layer
gets called first because the backward pass actually go through it first and then backprop to the conv layer
We have:
The first register_hook
,is for any Variable. It's essentially a callback function that is going to be executed every time when Autograd
gradient is computed.
While Module.register_backward_hook
& n.Module.register_forward_hook
are for nn.Module
object and their hook_fn
shoud take torch:
def hook_fn(m, i, o):
where i
refers to input and o
refers to output
Using named_parameters
function we can accomplish gradient modifying/clipping
.
The following example does two things:
conv layer
is less than 0 (all positive)class myNet(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3,10,2,stride=2)
self.relu = nn.ReLU()
self.flatten = lambda x: x.view(-1)
self.fc1 = nn.Linear(160,5)
def forward(self,x):
x = self.relu(self.conv(x))
x.register_hook(lambda grad: torch.clamp(grad, min=0)) # minimun back-prop gradient of value 0
# print whether there is any negative grad
x.register_hook(lambda grad: print("Gradients less than zero:", bool((grad<0).any())))
return self.fc1(self.flatten(x))
net = myNet()
for name, param in net.named_parameters():
print(name)
for name, param in net.named_parameters():
if 'fc' in name and 'bias' in name:
print(name, param, sep='\n')
for name, param in net.named_parameters():
if 'fc' in name and 'bias' in name:
# assign zero to bias grad with identical dimensions
param.register_hook(lambda grad: torch.zeros_like(grad))
out = net(torch.randn(1,3,8,8))
(1-out).mean().backward()
print(f'the bias for linear layer is: {net.fc1.bias.grad}')
pack_padded_sequence
& pad_packed_sequence
often used together dynamic RNNs
import torch
from torch import nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
# Create a tensor with variable length sequences and pads (25)
seqs = torch.LongTensor([[0, 1, 2, 3, 25, 25, 25],
[4, 5, 25, 25, 25, 25, 25],
[6, 7, 8, 9, 10, 11, 25]])
# Store lengths of the actual sequences, ignoring padding(25)
# These are the points up to which we want the RNN to process the sequence
seq_lens = torch.LongTensor([4,3,6]) # number of non-trivial elements in each row
seq_lens, sort_ind = seq_lens.sort(dim=0, descending=True)
seq_lens, sort_ind
seqs = seqs[sort_ind]
seqs
# Create an embedding layer, with 0 vectors for the pads
embeds = nn.Embedding(num_embeddings=26,
embedding_dim=10,
padding_idx=25)
lstm = nn.LSTM(10, 50, bidirectional=False, batch_first=True)
# WITHOUT dynamic batching
embeddings = embeds(seqs)
print(embeddings.size())
out_static, _ = lstm(embeddings)
out_static.size()
# The number of timesteps in the output will be the same as the total padded timesteps in the input,
# since the LSTM computed over the pads
assert out_static.size(1) == embeddings.size(1)
# Look at the output at a timestep that we know is a pad
print(out_static[1,-1])
Now let's try the same process with Dynamic Batching
# Pack the sequence
packed_seqs = pack_padded_sequence(embeddings, seq_lens.tolist(), batch_first=True)
print(f'the values in the seq_lens: {seq_lens.tolist()}, with the effective sum of {sum(seq_lens.tolist())}')
embeddings.shape,packed_seqs.data.size()
out_dynamic, _ = lstm(packed_seqs)
out_dynamic.data.size()
out_dynamic, lens = pad_packed_sequence(out_dynamic, batch_first=True)
out_dynamic.size(), lens
Note that here, out_dynamic
is padded in shape of [3,6,50]
instead of [3,7,50]
because we know we can discard one pad from all rows to make it even more compact.
In short, 6
is the longest sequence length in all batches.
assert out_dynamic.size(1) != embeddings.size(1)
print(out_dynamic.shape)
# Look at the output at a timestep that we know is a pad
print(out_dynamic[1, -1])
Final note:
pack_padded_sequence
removes pads, flattens by timestep, and keeps track of effective batch_size at each timestepTorchviz to visualize PyTorch execution graphs and traces
!pip install torchviz
import torch
from torch import nn
from torchviz import make_dot, make_dot_from_trace
model = nn.Sequential()
model.add_module('W0', nn.Linear(8,16))
model.add_module('tanh', nn.Tanh())
model.add_module('W1', nn.Linear(16,1))
inp = torch.randn(1,8)
make_dot(model(inp), params = dict(model.named_parameters()))
The method is built for directed graphs of PyTorch operations, built during forward propagation and showing which operations will be called on backward.
It omits subgraphs which do not require gradients.
from torchvision.models import AlexNet
model = AlexNet()
x = torch.randn(1,3,227,227).requires_grad_(True)
y = model(x)
make_dot(y, params = dict(list(model.named_parameters()) + [('x',x)]))
import torch
import torchvision.models as models
from Utils import *
# Create SSD300 with pretrained weights in the base-architecture
n_classes = 20
model = SSD300(n_classes)
x = torch.randn(1,3,300,300)
y = model(x)
dot = make_dot(y, params = dict(list(model.named_parameters())))
dot.render('VGG300_BN.gv', view=True)
from google.colab import drive
drive.mount('/content/drive')
This is a truly awesome repo full of practical tutorials that implements various state-of-the-art deep learning techniques using PyTorch including:
Basically a good place to look into when starting a new project to check for relevant realization techniques.
Since deep learning is such a fast developing fielding, if it weren't for the reason that this repo stoped getting updated 2 years ago, it should be #1 on this list.
AdaBound optimizer
Finally, AdaBound is available in PyTorch. One of the most powerful optimizer that out performs Adam in some cases with super fast convergence rate. Definely, something you would want to try out when fast prototyping.
The method is based on Adaptive Gradient Methods with Dynamic Bound of Learning Rate.In Proc. of ICLR 2019.
## implementation
optimizer = adabound.AdaBound(model.parameters(), lr=1e-3, final_lr=0.1)
As described in the paper, AdaBound is an optimizer that behavces like Adam at the beginning of the training, and gradually transforms to SGD at the end. In this way, it can combines the benefits of adaptive methods, viz. fast initial process, and the good final generalization properties of SGD.
The final_lr
parameter indicates Adabound would transforms to an SGD with this learninig rate. In common cases, a default final learning rate of 0.1
can achieve relatively good and statble results on unseen data.
This method is not very sensitive to it's hyperparameters. See Appendix G of the paper for more details
Flatten layer in PyTorch
import torch.nn as nn
class Flatten(nn.Module):
def __init__(self):
super(Flatten, self).__init__()
def forward(self, x):
return x.view(x.size(0), -1)
Expand_as in PyTorch for broadcasting
import torch
import torch.nn as nn
a = torch.tensor([1,2,3])
b = torch.tensor([[1,2,3],[4,5,6],[7,8,9]])
c = a.expand_as(b)
c
d = a+b
d # here a will be broadcasted before compute addition with b
FastAI listify
x = [1,2,3]
y = torch.arange(12)
x
y
from fastai.train import listify
z = listify(x)
z
z = listify(1,x)
z
a = listify(1,y)
a
b = listify('good',x)
b
In_place
# an example of NOT in-place
a = torch.randn(1)
b = torch.randn(1)
id(a)
id(b)
# an example of in-place
c = torch.randn(1)
d = torch.randn(1)
id(c)
id(d)
c += d
id(c) # not changed because in-place
# another example of in-place
e = torch.randn(1)
f = torch.randn(1)
id(e)
id(f)
e.add_(f)
id(e) # this case, in-place
In PyTroch _ as postfix means inplace.
The variable will be modified and stored in the same memory place without creating a no vacancy for storage
AdaptiveConcatPool2d
import torch.nn as nn
# nn.AdaptiveAvgPool2d??
class AdaptiveConcatPool2d(nn.Module):
def __init__(self, sz=1):
super().__init__()
self.dropout_size = sz
self.ap = nn.AdaptiveAvgPool2d(sz)
self.mp = nn.AdaptiveMaxPool2d(sz)
def forward(self, x):
return torch.cat([self.ap(x), self.mp(x)],dim=1)
x = torch.tensor([
[
[1.,2.,3.],
[1.,2.,3.],
[1.,2.,4.]
],
[
[1.,2.,3.],
[1.,2.,3.],
[1.,2.,5.]
],
[
[1.,2.,3.],
[1.,2.,3.],
[1.,2.,3.]
]
])
x.shape
A = nn.AdaptiveAvgPool2d(1) # specify the output size
print(A(x).shape)
A(x)
M = nn.AdaptiveMaxPool2d(1)
M(x)
A = nn.AdaptiveAvgPool2d((1,3))
print(A(x).shape)
A(x)
C = AdaptiveConcatPool2d(1)
C(x)
logsumexp
a = torch.zeros(1,3)
a
b = torch.logsumexp(input=a,dim=1,keepdim=False)
b
zero = torch.tensor([0],dtype=torch.float)
torch.log(torch.exp(zero)+torch.exp(zero)+torch.exp(zero))
c = torch.ones(1,3)
c
d = torch.logsumexp(c,dim=1)
d
one = torch.tensor([1], dtype=torch.float)
torch.log(torch.exp(one)+torch.exp(one)+torch.exp(one))
Named_children
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d( 1, 10, 3)
self.conv2 = nn.Conv2d(10, 20, 3)
self.conv2_dropout = nn.Dropout2d(p=0.5)
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
pass
x = Net()
for l, name in x.named_children():
print(f"layer {l} is: {name}")
torch.addcmul()
x = torch.ones(1,3)
y = torch.ones(3,1)
z = torch.ones(1,1)*2
x, y, z
# torch.addcmul(input, value=1, tensor1, tensor2)
a = torch.addcmul(z, 0.5, x, y) # z + 0.5*x*y
a
x,y
x*y
z + 0.5*x*y
torch.permute
used to re-arrange the dimension of a given tensor
x = torch.randn(3,4)
x
y = x.permute(1,0)
y
a = torch.randn(3,4,5,6,7,8)
a.shape
b = a.permute(2,1,0,4,3,5)
b.size()
Creating a concise four layer CNN
def conv_block(in_channels, out_channels):
return nn.Sequential(nn.Conv2d(in_channels, out_channels, 2),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2))
class ConvNet(nn.Module):
def __init__(self, x_dim=3, hid_dim=64, z_dim=64):
super().__init__()
self.encoder = nn.Sequential(
conv_block(x_dim, hid_dim),
conv_block(hid_dim, hid_dim),
conv_block(hid_dim, hid_dim),
conv_block(hid_dim, z_dim)
)
def forward(self, x):
x = self.encoder(x)
x = nn.MaxPool2d(5)(x)
x = x.view(x.size(0), -1) # flatten while only retain the batch_size dimenison
return x
net = ConvNet()
net
The mechanism behind torch.dropout()
y = torch.ones(3,3)
y
D = nn.Dropout(0)
D(y)
D = nn.Dropout(0.5)
D(y)
D = nn.Dropout(1)
D(y)
D = nn.Dropout(0.3)
D(y)
D = nn.Dropout(0.8)
D(y)
Final note, the output value will be original/(1-p)
Creating mini-batch
import torch
x = torch.randn(3,128,128)
x.shape
t = x.unsqueeze(0)
t.shape
u = x[None,:]
u.shape
v = x[None]
v.shape
Look into torch.nn.ReLU()
x = torch.randn(3,3)
x
y = nn.ReLU()
print(y(x)) # All negative values goes to zero, inplance default is False
x
y = nn.ReLU(inplace=True)
print(y(x))
x
change torch.tensor type
x = torch.randn(3,3)
x.dtype
x = x.type(torch.long)
x.dtype
x = torch.randn(3,3).type(torch.float)
x.dtype
x = torch.ones(3,3, dtype=torch.long)
x.dtype
L1Loss
vs MSELoss
x = torch.randn(1)
x
y = torch.ones(1)
y
z = nn.L1Loss()
z(x,y)
abs(x-y)
a = nn.MSELoss()
a(x,y)
pow((x-y),2)
Sigmoid in PyTorch
x = torch.randn(1)
x
y = nn.Sigmoid()
y(x)
import math
(1/(1+math.exp(-x)))
z = torch.ones(3,4)
z
y(z)
z = torch.randn(3,4)
z
y(z)
Softmax in Pytorch
x = torch.randn(2,2)
x
y = nn.Softmax()
a = y(x);a
a[0][0]+a[0][1]
a[1][0]+a[1][1]
nn.ModuleList
x = nn.ModuleList([nn.Dropout(0.5),
nn.ReLU()])
x
y = torch.randn(3,3)
y
t = x[0](y);t # performed dropout, and value modified as original/(1-0.5)
r = x[1](t);r # performed ReLU
nn.Linear
x = torch.randn(2);x
a = nn.Linear(2,1);a
a.weight, a.bias
a(x)
x@a.weight.t()+a.bias
torch.mean()
x = torch.FloatTensor([[1,2,3,4],[5,6,7,8]])
x.shape
y = x.mean()
y
y = x.mean(dim=1, keepdim=True);y
y = x.mean(dim=1, keepdim=False);y
y.shape
x = torch.randn(3,4,5)
x.shape
y = x.mean(dim=1, keepdim=False);y.shape
y = x.mean(dim=1, keepdim=True);y.shape
y
Use dropblock
in PyTorch
!pip install dropblock
x = torch.ones(5,5,5,5);x.shape
import dropblock
y = dropblock.DropBlock2D(drop_prob=0.5, block_size=2)
y(x) # dropout 2x2 size blocks with chance of 50%
y = dropblock.DropBlock2D(drop_prob=0.5, block_size=3)
y(x) # dropout 3x3 size blocks with chance of 50%
Orthogonal Initialization in PyTorch
import torch
import torch.nn as nn
x, y, z = [torch.zeros(3, 3)]*3
x, y, z
a = nn.init.orthogonal_(x, gain=1) # orthogonal means A@A.t() = I
a
a@a.t()
torch.eye(3)
b = nn.init.orthogonal_(y, gain=5) # gain adjusted
b
b@b.t()
25*torch.eye(3)
Final note: remember the initialization process is random. We get different matrix by re-running the cell
Param_groups in nn.Modules
import torch
import torch.nn as nn
from torch import optim
l = nn.Linear(3,3)
r = optim.SGD(l.parameters(),lr=0.01)
r
r.param_groups
r.param_groups[0]['params']
# The first is the weight, the second is the bias
for count, i in enumerate(l.parameters()):
print(count)
print(i)
All of these info can be accessed by the optimizor's param_groups
Math behind "standard_deviation"
import torch
x = torch.tensor([1.,2.,3.,4.,5.,6.])
m = x.mean()
(x-m)
(x-m).mean()
(x-m).pow(2).mean()
(x-m).pow(2).mean().sqrt()
x.std(unbiased=False)
# If unbiased is False, then the standard-deviation will be calculated via the biased estimator.
# Otherwise, Bessel’s correction will be used.
x.std(unbiased=True)
Layer-sequential unit-variance(LSUV) initialization implementation.
This technique is proposed by the All you need is a good init paper in 2015.
As we know, a good initialization should get the layer's weight to have standard deviation near to 1.0
and the mean near to 0.0
no matter how deep the layer is. Take note that, when using this LSUVInit
package we will have a summary at the end telling us how well it has done.
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
class LSUVInit(object):
def __init__(self,
model: nn.Module,
data_loader: DataLoader,
needed_std: float = 1.0,
std_tol: float = 0.1,
max_attempts: int = 10,
do_orthonorm: bool = True,
device: torch.device = 'str') -> None:
self._model = model
self.data_loader = data_loader
self.needed_std = needed_std
self.std_tol = std_tol
self.max_attempts = max_attempts
self.do_orthonorm = do_orthonorm
self.device = device
self.eps = 1e-8
self.hook_position = 0
self.total_fc_conv_layers = 0
self.done_counter = -1
self.hook = None
self.act_dict: np.ndarray = None
self.counter_to_apply_correction = 0
self.correction_needed = False
self.current_coef = 1.0
def svd_orthonormal(self, w: np.ndarray) -> np.ndarray:
shape = w.shape
if len(shape) < 2:
raise RuntimeError("Only shapes of length 2 or more are supported.")
flat_shape = (shape[0], np.prod(shape[1:]))
a = np.random.normal(0.0, 1.0, flat_shape) # w;
u, _, v = np.linalg.svd(a, full_matrices=False)
q = u if u.shape == flat_shape else v
print(shape, flat_shape)
q = q.reshape(shape)
return q.astype(np.float32)
def count_conv_fc_layers(self, m: nn.Module) -> None:
if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
self.total_fc_conv_layers += 1
def orthogonal_weights_init(self, m: nn.Module) -> None:
if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
if hasattr(m, 'weight_v'):
w_ortho = self.svd_orthonormal(m.weight_v.data.cpu().numpy())
m.weight_v.data = torch.from_numpy(w_ortho)
try:
nn.init.constant_(m.bias, 0)
except Exception:
pass
else:
w_ortho = self.svd_orthonormal(m.weight.data.cpu().numpy())
m.weight.data = torch.from_numpy(w_ortho)
try:
nn.init.constant_(m.bias, 0)
except Exception:
pass
def store_activations(self,
module: nn.Module,
data: torch.Tensor,
output: torch.Tensor) -> None:
self.act_dict = output.detach().cpu().numpy()
def add_current_hook(self, m: nn.Module) -> None:
if self.hook is not None:
return
if (isinstance(m, nn.Conv2d)) or (isinstance(m, nn.Linear)):
if self.hook_position > self.done_counter:
self.hook = m.register_forward_hook(self.store_activations)
else:
self.hook_position += 1
def apply_weights_correction(self, m: nn.Module) -> None:
if self.hook is None:
return
if not self.correction_needed:
return
if (isinstance(m, nn.Conv2d)) or (isinstance(m, nn.Linear)):
if self.counter_to_apply_correction < self.hook_position:
self.counter_to_apply_correction += 1
else:
if hasattr(m, 'weight_g'):
m.weight_g.data *= float(self.current_coef)
self.correction_needed = False
else:
m.weight.data *= self.current_coef
self.correction_needed = False
def initialize(self) -> nn.Module:
model = self._model
model.eval()
model.apply(self.count_conv_fc_layers)
if self.do_orthonorm:
model.apply(self.orthogonal_weights_init)
model = model.to(self.device)
for layer_idx in range(self.total_fc_conv_layers):
print(layer_idx)
model.apply(self.add_current_hook)
data = next(iter(self.data_loader))
data, _ = [d.to(self.device) for d in data]
model(data)
current_std = self.act_dict.std()
print('std at layer ', layer_idx, ' = ', current_std)
attempts = 0
while (np.abs(current_std - self.needed_std) > self.std_tol):
self.current_coef = self.needed_std / (current_std + self.eps)
self.correction_needed = True
model.apply(self.apply_weights_correction)
model = model.to(self.device)
model(data)
current_std = self.act_dict.std()
print('std at layer ', layer_idx, ' = ', current_std, 'mean = ', self.act_dict.mean())
attempts += 1
if attempts > self.max_attempts:
break
if self.hook is not None:
self.hook.remove()
self.done_counter += 1
self.counter_to_apply_correction = 0
self.hook_position = 0
self.hook = None
print('finish at layer', layer_idx)
print('LSUV init done!')
return model
def lsuv_init(model: nn.Module,
data_loader: DataLoader,
needed_std: float,
std_tol: float,
max_attempts: int,
do_orthonorm: bool,
device: torch.device) -> nn.Module:
return LSUVInit(
model, data_loader, needed_std, std_tol,
max_attempts, do_orthonorm, device).initialize()
import torchvision
import torchvision.transforms as transforms
print(f"CUDA available: {torch.cuda.is_available()}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# specify transforms
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
# Download dataset and create dataloader
trainset = torchvision.datasets.CIFAR10(root='./', train=True, download=True, transform=transform_train)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
# our model architecture
x = nn.Sequential(nn.Conv2d(3,8,3), nn.Conv2d(8,16,3))
# create our model while initializa with LSUV init
model = lsuv_init(x, train_loader, needed_std=1., std_tol=0.1, max_attempts=10, do_orthonorm=True, device=device)
1x1_conv
import torch
import torch.nn as nn
inp = torch.randn(1,1,128,128) # batch_size of 1, 1x128x128 image
# create encoder and decoder of 1x1 conv2d
enc = nn.Conv2d(1,10,kernel_size=1)
dec = nn.Conv2d(10,1,kernel_size=1)
pred = enc(inp) # increase the dimentionality
pred.shape
pred_2 = dec(pred) # decrease the dimentionality
pred_2.shape
nn.Conv2d
in PyTorch
import torch
import torch.nn as nn
from fastai.vision import show_image
inp = torch.randn(1,1,128,128)
show_image(inp[0]) # show image of random pixel intensity
conv = nn.Conv2d(1,3, kernel_size=3)
pred = conv(inp)
pred.shape
show_image(pred[0]) # we mae image RGB, from 1 channel to 3 channels
conv_next = nn.Conv2d(3,1,kernel_size=3)
pred_next = conv_next(pred)
pred_next.shape
show_image(pred_next[0]) # conver back to 1 channel by specifying channels
italicized text# Trick #33
PyTorch hooks #2 Take another look into pytorch hooks
import torch
import torch.nn as nn
import torch.nn.init as init
from fastai.vision import children
x = torch.randn(1,1,128,128) # one batch size, 128x128 image
model = nn.Sequential(nn.Conv2d(1,3,kernel_size=3),nn.ReLU())
model
# initializing weights
def weight_init_orthogonal(m):
classname = m.__class__.__name__
print(classname)
if classname.find("Conv") != -1:
init.orthogonal_(m.weight.data, gain=1)
model.apply(weight_init_orthogonal)
# We want to see the output after our input is passed to Conv2d layer
# and then we will be able to pass it to any layer
# for saving feature after every layer by registering forawrd hook
class SaveFeature():
feature=None
# when we initialize, we register hook_fn on to the forward pass
def __init__(self, m):
self.hook = m.register_forward_hook(self.hook_fn)
def hook_fn(self, module, input, output):
self.features = output
def close(self):
self.hook.remove()
# we have 2 children here: Conv2d and ReLU
# later we will be registering two different hooks on each
children(model)
saved_features_conv, saved_features_relu = SaveFeature(children(model)[0]), SaveFeature(children(model)[1])
saved_features_conv.features # hook registered but nothing stored yet
# Let's run one forward pass
pred = model(x)
saved_features_conv.features # output of conv layer in this forward pass is stored
saved_features_conv.features.shape # check out the shape ofoutput of conv layer in this forward pass is stored
saved_features_relu.features # features after forward pass
# we pass the output after first Conv Layer to another neural network
model2 = nn.Sequential(nn.Dropout(0.5))
t = model2(saved_features_conv.features)
t
t.shape
step-by-step the computation of training neural_netwoks in PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
inp = torch.ones(1, requires_grad=True) # input, a number whose value is one
inp
outp = torch.zeros(1, requires_grad=True) # output, we want a number whose value is 0
outp
x = nn.Linear(1,1) # we pass input to linear layer
list(x.parameters()) # weights and bias of our linear layer, which we modify to get correct prediction
for t in x.parameters():
print(t.grad) # now there is nothing stored
# forward pass
pred = x(inp)
pred # we want pred to be zero (calculate loss with outp)
loss_function = nn.L1Loss()
loss = loss_function(pred, outp)
loss
a = optim.SGD(x.parameters(), lr=0.01)
a.zero_grad()
for t in x.parameters():
print(t.grad)
loss.backward()
loss # no change to the loss itself
for t in x.parameters():
print(t.grad) # now gradients have been computed
a.step() # update step
list(x.parameters()) # weights have already been updated
for t in x.parameters():
print(t.grad) # no change for gradient
Again
pred = x(inp)
pred
for t in x.parameters():
print(t.grad)
loss = loss_function(pred, outp)
loss
a.zero_grad()
loss.backward()
a.step()
list(x.parameters())
for t in x.parameters():
print(t.grad)
# Trick #33