Some Basic Usage of PyTorch
Installation
conda create --name pytorch python=3.6 numpy
conda activate pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Verification:
python
from __future__ import print_function
import torch
x = torch.rand(5, 3)
print(x)
torch.cuda.is_available() # check whether the GPU driver and CUDA is enabled and accessible by PyTorch
Output:
tensor([[0.3380, 0.3845, 0.3217],
[0.8337, 0.9050, 0.2650],
[0.2979, 0.7141, 0.9069],
[0.1449, 0.1132, 0.1375],
[0.4675, 0.3947, 0.1426]])
True
Basic Usage
- A replacement for
NumPy
to use the power of GPUs. - A deep learning research platform that provides maximum flexibility and
speed
.
Tensor1
- Similar to NumPy’s ndarrays.
- Tensor can also be used on GPU to accelerate computing.
Construct a Tensor
from __future__ import print_function
import torch
x = torch.empty(5, 3) # unintialized
print(x)
x = torch.rand(5, 3) # initialized
print(x)
x = torch.zeros(5, 3, dtype=torch.long) # filled zeros and of dtype long
print(x)
x = torch.tensor([5.5, 3]) # directly from data
print(x)
x = x.new_ones(5, 3, dtype=torch.double) # new_* methods take in sizes
print(x)
x = torch.randn_like(x, dtype=torch.float) # result has the same size, override dtype
print(x)
print(x.size()) # get the size, torch.Size([5, 3]), which is a tuple
Operations
- add.
- Other operations are described here.
y = torch.rand(5, 3)
print(x + y)
print(torch.add(x, y))
result = torch.empty(5, 3) # provide an output tensor as argument
torch.add(x, y, out=result)
print(result)
y.add_(x) # add x to y in-place
print(y)
print(y[:, 1])
x = torch.randn(4, 4)
y = x.view(16) # resize/reshape tensor
x = torch.randn(1)
print(x)
print(x.item()) # get the value as a python number, tensor([0.9551]), 0.9551321864128113
Any operation that mutates a tensor in-place is post-fixed with an
_
, e.g.x.copy_(y)
,x.t_()
.
Change the Shape of Tensor
view()
reshape()
unsqueeze()
: add a dimension.squeeze()
: cut the dimension which size is 1.expand()
: expand the dimension which size is 1 to a large size.repeat()
import torch
a = torch.rand(3, 1, 5, 4)
print(a.shape) # torch.Size([3, 1, 5, 4])
print(a.view(3, 5, 4).shape) # torch.Size([3, 5, 4])
print(a.reshape(3, 5, 4).shape) # torch.Size([3, 5, 4])
print(a.unsqueeze(0).shape) # torch.Size([1, 3, 1, 5, 4])
print(a.unsqueeze(-1).shape) # torch.Size([3, 1, 5, 4, 1])
print(a.unsqueeze(2).shape) # torch.Size([3, 1, 1, 5, 4])
print(a.squeeze().shape) # torch.Size([3, 5, 4])
print(a.squeeze(1).shape) # torch.Size([3, 5, 4])
print(a.squeeze(3).shape) # torch.Size([3, 1, 5, 4]), can not cut, beause the size=4
print(a.expand(-1, 6, -1, -1).shape) # torch.Size([3, 6, 5, 4])
print(a.repeat(2, 2, 1, 1).shape) # torch.Size([6, 2, 5, 4])
Splice and merge the Tensors
torch.cat(tensors,dim=0,out=None)
: splice the two tensors, the dimension is not changed.torch.stack(tensors,dim=0,out=None)
: merge the tensors in a list/tuple, the dimension is increased.
import torch
a = torch.rand((2, 3))
b = torch.rand((2, 3))
c = torch.cat((a, b))
print(a.size(), b.size(), c.size())
'''
torch.Size([2, 3]), torch.Size([2, 3]), torch.Size([4, 3])
'''
d = torch.stack((a, b))
print(a.size(), b.size(), d.size())
'''
torch.Size([2, 3]), torch.Size([2, 3]), torch.Size([2, 2, 3])
'''
CPU Tensor \(\rightleftharpoons\) NumPy
share their underlying memory locations
(if the tensor is onCPU
), change one will change the other.
Converting a tensor to a NumPy array
a = torch.ones(5)
print(a) # tensor([1., 1., 1., 1., 1.])
b = a.numpy()
print(b) # [1. 1. 1. 1. 1.]
a.add_(1)
print(a) # tensor([2., 2., 2., 2., 2.])
print(b) # [2. 2. 2. 2. 2.], changed the value of b
Converting a NumPy array to a tensor
import numpy as np
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a) # [2. 2. 2. 2. 2.]
print(b) # tensor([2., 2., 2., 2., 2.], dtype=torch.float64)
CUDA Tensors
- Tensors can be moved onto any device using
.to
.
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
device = torch.device("cuda") # a CUDA device object
x = torch.randn(1)
y = torch.ones_like(x, device=device) # directly create a tensor on GPU
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z) # tensor([1.9551], device='cuda:0')
print(z.to("cpu", torch.double)) # tensor([1.9551], dtype=torch.float64), ``.to`` can also change dtype together!
Autograd
- The
autograd
package provides automatic differentiation for all operations on Tensors. - Define-by-run framework, the backprop is defined by how the code is run, and every single iteration can be different.
.requires_grad = True
: start to track all operations on the tensor..backward()
: all the gradients computed automatically, and the gradients for this tensor will be accumulated into.grad
attribute.with torch.no_grad():
or.detach()
: to stop a tensor from tracking history, and to prevent future computation from being tracked. When evaluating a model, we can use it for which we don’t need the gradients.
Used in Tensor
x = torch.ones(2, 2, requires_grad=True)
print(x)
'''
tensor([[1., 1.],
[1., 1.]], requires_grad=True)
'''
y = x + 2
print(y)
'''
tensor([[3., 3.],
[3., 3.]], grad_fn=<AddBackward0>)
'''
print(y.grad_fn)
'''
<AddBackward0 object at 0x7f67610c4160>
'''
z = y*y*3
out = z.mean()
print(z, out)
'''
tensor([[27., 27.],
[27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)
'''
a = torch.randn(2, 2)
a = ((a*3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a*a).sum()
print(b.grad_fn)
'''
False
True
<SumBackward0 object at 0x7f67610c4e48>
'''
Gradients for Backprop
The tensor out
is defined as:
Then,
\[\frac{\partial out}{\partial x_i} = \frac{3}{2} (x_i + 2).\]out.backward()
print(x.grad)
'''
tensor([[4.5000, 4.5000],
[4.5000, 4.5000]])
'''
Stop autograd
print(x.requires_grad)
print((x**2).requires_grad)
with torch.no_grad():
print((x**2).requires_grad)
'''
True
True
False
'''
# use detach to get a new tensor with the same content but does not require gradients
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())
'''
True
False
tensor(True)
'''
Neural Networks
- Neural networks can be constructed using
torch.nn
package. nn.Module
contains layers, and a methodforward(input)
that returns theoutput
.backward()
function (where gradients are computed) is automatically defined usingautograd
.
Training Procedure for a Neural Network
- Define the nn that has some learnable parameters (weights).
- Iterate over a dataset of input.
- Process input through the network.
- Compute the loss (how far is the output from being correct).
- Propagate gradients back into the nn’s parameters.
- Update the weights of the nn.
Define the Network
torch.nn
only supports inputs that are a mini-batch of samples, not a single sample.nn.Conv2d
will take in a 4D Tensor ofnSamples X nChannels X Height X Width
.- If there is a single sample, just use
input.unsqueeze(0)
to add a fake batch dimension.
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 3x3 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def num_flat_features(self, x):
size = x.size()[1:] # all dimensions except the batch dimension
num_features = 1
for s in size:
num_features *= s
return num_features
net = Net()
print(net)
Output:
Net(
(conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=576, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
The learnable parameters (weights) of the model are return by net.parameters()
:
params = list(net.parameters())
print(len(params))
print(params[0].size()) # conv1's weights
'''
10
torch.Size([6, 1, 3, 3])
'''
Usage of the network:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)
'''
tensor([[ 0.0582, -0.1492, -0.1571, -0.0059, -0.0130, -0.0429, -0.0989, 0.0680,
-0.0835, -0.0217]], grad_fn=<AddmmBackward>)
'''
Distributions
- https://pytorch.org/docs/stable/distributions.html
torch.distributions
package contains parameterizable probability distributions and sampling functions.- This allows the construction of stochastic computation graphs and stochastic gradient estimators for optimization.
REINFORCE Algorithm (example)
- In practice, we would
sample
an action from the output of a network, and apply this action in an environment. - Then, use
log_prob
to construct an equivalent loss function. - With a categorical policy, we can implement REINFORCE.
The code of implementing REINFORCE is as follows:
probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Categorical
probs
- 1D: each element is the relative probability of sampling the class at that index.
- 2D: it is treated as a batch of relative probability vectors.
probs = torch.tensor([ 0.25, 0.25, 0.25, 0.25 ])
m = Categorical(probs)
m.sample() # equal probability of 0, 1, 2, 3
'''
tensor(3)
'''
MultivariateNormal
MultivariateNormal(loc, covariance_matrix=None, precision_matrix=None, scale_tril=None, validate_args=None)
- Create a
multivariate Normal/Gaussian distribution
parameterized by amean
vector and acovariance matrix
. loc
: mean of the distribution.covariance_matrix
: positive-definite corvariance matrix.precision_matrix
: positive-definite precision matrix.scale_tril
: lower-triangular factor of covariance, with positive-valued diagonal.
m = MultivariateNormal(torch.zeros(2), torch.eye(2))
action = m.sample() # sample from normally distributed with mean=`[0,0]` and covariance_matrix=`I`
action_logprob = m.log_prob(action) # compute the log probability of action
entropy = m.entropy() # compute the entropy of the distribution
'''
tensor([-0.2102, -0.5429])
'''
Normal
Normal(loc, scale, validate_args=None)
- Create a normal/Gaussian distribution parameterized by
loc
andscale
. loc
: mean of the distribution (mu).scale
: standard deviation of the distribution (sigma).
m = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
m.sample() # normally distributed with loc=0 and scale=1
'''
tensor([ 0.1046])
'''
Uniform
Uniform(low, high, validate_args=None)
- Generate uniformly distributed random samples from the half-open interval
[low, high)
.
m = Uniform(torch.tensor([0.0]), torch.tensor([5.0]))
m.sample() # uniformly distributed in the range [0.0, 5.0)
'''
tensor([ 2.3418])
'''
KL Divergence
\[KL(p \Vert q) = \int p(x) \log \frac{p(x)}{q(x)}dx.\]torch.distributions.kl.kl_divergence(p, q)
Loss Function
- Take the (output, target) pair as inputs, compute a value that estimate how far away the output is from the target.
- There are several different loss functions under
nn
package:nn.MSELoss()
, all described here.
output = net(input)
target = torch.randn(10) # a dummy target, for example
target = target.view(1, -1) # make it the same shape as output
criterion = nn.MSELoss()
loss = criterion(output, target)
print(loss)
'''
tensor(1.8713, grad_fn=<MseLossBackward>)
'''
Backprop
- Use
loss.backward()
to backpropagate the errors. net.zero_grad()
: Need to clear the existing gradients, else gradients will be accumulated to existing gradients.
net.zero_grad() # zeroes the gradient buffers of all parameters
print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)
loss.backward()
print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)
Output:
conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0166, 0.0530, 0.0216, -0.0270, -0.0130, -0.0092])
Update the Weights
Update rule (SGD)
\[W = W - lr * g.\]Implement use Python
lr = 0.01
for f in net.parameters():
f.data.sub_(f.grad.data * lr)
Implement use PyTorch
- We can use various different update rules such as
SGD, Nesterov-SGD, Adam, RMSProp
, etc. torch.optim
: implement all the methods above.
import torch.optim as optim
# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)
# in your training loop:
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update
Save and Load the Model
Only save the parameters of the model
torch.save(the_model.state_dict(), PATH)
Load the model parameters:
the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))
Save the whole model
torch.save(the_model, PATH)
Load the model:
the_model = torch.load(PATH)
Data Parallelism
Use multiple GPUs
- PyTorch will only use one GPU by default.
- Use multiple GPUs using
DataParallel
:model = nn.DataParallel(model)
. - DataParallel splits the data automatically and sends job orders to multiple models on several GPUs.
- After each model finishes the job, DataParallel collects and merges the results before returning it to you.
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
for data in rand_loader:
input = data.to(device)
output = model(input)
print("Outside: input size", input.size(),
"output_size", output.size())
'''
Let's use 2 GPUs!
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
'''
Part use CPU and part use GPU
device = torch.device("cuda:0")
class DistributedModel(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10).to(device),
)
def forward(self, x):
# Compute embedding on CPU
x = self.embedding(x)
# Transfer to GPU
x = x.to(device)
# Compute RNN on GPU
x = self.rnn(x)
return x
Sampler
from torch.utils.data.sampler import BatchSampler, SubsetRandomSampler
buffer = [[1,2],[3,4],[5,6],[7,8],[9,10]]
for index in BatchSampler(SubsetRandomSampler(range(len(buffer))), 2, False):
print(index) # print the index of sampled buffer
'''
[0, 1]
[3, 4]
[2]
'''
for index in BatchSampler(SubsetRandomSampler(range(len(buffer))), 2, True):
print(index)
'''
[1, 3]
[0, 2]
'''