Installation

conda create --name pytorch python=3.6 numpy
conda activate pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Verification:

python

from __future__ import print_function
import torch
x = torch.rand(5, 3)
print(x)

torch.cuda.is_available()  # check whether the GPU driver and CUDA is enabled and accessible by PyTorch

Output:

tensor([[0.3380, 0.3845, 0.3217],
        [0.8337, 0.9050, 0.2650],
        [0.2979, 0.7141, 0.9069],
        [0.1449, 0.1132, 0.1375],
        [0.4675, 0.3947, 0.1426]])

True

Basic Usage

  • A replacement for NumPy to use the power of GPUs.
  • A deep learning research platform that provides maximum flexibility and speed.

Tensor1

  • Similar to NumPy’s ndarrays.
  • Tensor can also be used on GPU to accelerate computing.

Construct a Tensor

from __future__ import print_function
import torch

x = torch.empty(5, 3)   # unintialized
print(x)

x = torch.rand(5, 3)   # initialized
print(x)

x = torch.zeros(5, 3, dtype=torch.long)   # filled zeros and of dtype long
print(x)

x = torch.tensor([5.5, 3])   # directly from data
print(x)

x = x.new_ones(5, 3, dtype=torch.double)    # new_* methods take in sizes
print(x)

x = torch.randn_like(x, dtype=torch.float)   # result has the same size, override dtype
print(x)

print(x.size())   # get the size, torch.Size([5, 3]), which is a tuple

Operations

  • add.
  • Other operations are described here.
y = torch.rand(5, 3)
print(x + y)

print(torch.add(x, y))

result = torch.empty(5, 3)   # provide an output tensor as argument
torch.add(x, y, out=result)
print(result)

y.add_(x)   # add x to y in-place
print(y)
print(y[:, 1])

x = torch.randn(4, 4)
y = x.view(16)   # resize/reshape tensor

x = torch.randn(1)
print(x)
print(x.item())   # get the value as a python number, tensor([0.9551]), 0.9551321864128113

Any operation that mutates a tensor in-place is post-fixed with an _, e.g. x.copy_(y), x.t_().

Change the Shape of Tensor

  • view()
  • reshape()
  • unsqueeze(): add a dimension.
  • squeeze(): cut the dimension which size is 1.
  • expand(): expand the dimension which size is 1 to a large size.
  • repeat()
import torch

a = torch.rand(3, 1, 5, 4)
print(a.shape)   # torch.Size([3, 1, 5, 4])
print(a.view(3, 5, 4).shape)   # torch.Size([3, 5, 4])
print(a.reshape(3, 5, 4).shape)   # torch.Size([3, 5, 4])
print(a.unsqueeze(0).shape)   # torch.Size([1, 3, 1, 5, 4])
print(a.unsqueeze(-1).shape)   # torch.Size([3, 1, 5, 4, 1])
print(a.unsqueeze(2).shape)   # torch.Size([3, 1, 1, 5, 4])
print(a.squeeze().shape)   # torch.Size([3, 5, 4])
print(a.squeeze(1).shape)   # torch.Size([3, 5, 4])
print(a.squeeze(3).shape)   # torch.Size([3, 1, 5, 4]), can not cut, beause the size=4
print(a.expand(-1, 6, -1, -1).shape)   # torch.Size([3, 6, 5, 4])
print(a.repeat(2, 2, 1, 1).shape)   # torch.Size([6, 2, 5, 4])

Splice and merge the Tensors

  • torch.cat(tensors,dim=0,out=None): splice the two tensors, the dimension is not changed.
  • torch.stack(tensors,dim=0,out=None): merge the tensors in a list/tuple, the dimension is increased.
import torch

a = torch.rand((2, 3))
b = torch.rand((2, 3))
c = torch.cat((a, b))
print(a.size(), b.size(), c.size())
'''
torch.Size([2, 3]), torch.Size([2, 3]), torch.Size([4, 3])
'''
d = torch.stack((a, b))
print(a.size(), b.size(), d.size())
'''
torch.Size([2, 3]), torch.Size([2, 3]), torch.Size([2, 2, 3])
'''

CPU Tensor \(\rightleftharpoons\) NumPy

  • share their underlying memory locations (if the tensor is on CPU), change one will change the other.

Converting a tensor to a NumPy array

a = torch.ones(5)
print(a)   # tensor([1., 1., 1., 1., 1.])
b = a.numpy()
print(b)   # [1. 1. 1. 1. 1.]

a.add_(1)
print(a)   # tensor([2., 2., 2., 2., 2.])
print(b)   # [2. 2. 2. 2. 2.], changed the value of b

Converting a NumPy array to a tensor

import numpy as np
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)   # [2. 2. 2. 2. 2.]
print(b)   # tensor([2., 2., 2., 2., 2.], dtype=torch.float64)

CUDA Tensors

  • Tensors can be moved onto any device using .to.
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    x = torch.randn(1)
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)                               # tensor([1.9551], device='cuda:0')
    print(z.to("cpu", torch.double))       # tensor([1.9551], dtype=torch.float64), ``.to`` can also change dtype together!

Autograd

  • The autograd package provides automatic differentiation for all operations on Tensors.
  • Define-by-run framework, the backprop is defined by how the code is run, and every single iteration can be different.
  • .requires_grad = True: start to track all operations on the tensor.
  • .backward(): all the gradients computed automatically, and the gradients for this tensor will be accumulated into .grad attribute.
  • with torch.no_grad(): or .detach(): to stop a tensor from tracking history, and to prevent future computation from being tracked. When evaluating a model, we can use it for which we don’t need the gradients.

Used in Tensor

x = torch.ones(2, 2, requires_grad=True)
print(x)
'''
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
'''
y = x + 2
print(y)
'''
tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)
'''
print(y.grad_fn)
'''
<AddBackward0 object at 0x7f67610c4160>
'''
z = y*y*3
out = z.mean()
print(z, out)
'''
tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)
'''

a = torch.randn(2, 2)
a = ((a*3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a*a).sum()
print(b.grad_fn)
'''
False
True
<SumBackward0 object at 0x7f67610c4e48>
'''

Gradients for Backprop

The tensor out is defined as:

\[out = \frac{1}{4} \sum_i 3(x_i + 2)^2, \quad x_i = 1, i = \{1,2,3,4\}.\]

Then,

\[\frac{\partial out}{\partial x_i} = \frac{3}{2} (x_i + 2).\]
out.backward()
print(x.grad)
'''
tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])
'''

Stop autograd

print(x.requires_grad)
print((x**2).requires_grad)
with torch.no_grad():
    print((x**2).requires_grad)
'''
True
True
False
'''
# use detach to get a new tensor with the same content but does not require gradients
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())
'''
True
False
tensor(True)
'''

Neural Networks

  • Neural networks can be constructed using torch.nn package.
  • nn.Module contains layers, and a method forward(input) that returns the output.
  • backward() function (where gradients are computed) is automatically defined using autograd.

Training Procedure for a Neural Network

  1. Define the nn that has some learnable parameters (weights).
  2. Iterate over a dataset of input.
  3. Process input through the network.
  4. Compute the loss (how far is the output from being correct).
  5. Propagate gradients back into the nn’s parameters.
  6. Update the weights of the nn.

Define the Network

  • torch.nn only supports inputs that are a mini-batch of samples, not a single sample.
  • nn.Conv2d will take in a 4D Tensor of nSamples X nChannels X Height X Width.
  • If there is a single sample, just use input.unsqueeze(0) to add a fake batch dimension.
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()
print(net)

Output:

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

The learnable parameters (weights) of the model are return by net.parameters():

params = list(net.parameters())
print(len(params))
print(params[0].size())   # conv1's weights

'''
10
torch.Size([6, 1, 3, 3])
'''

Usage of the network:

input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

'''
tensor([[ 0.0582, -0.1492, -0.1571, -0.0059, -0.0130, -0.0429, -0.0989,  0.0680,
         -0.0835, -0.0217]], grad_fn=<AddmmBackward>)
'''

Distributions

  • https://pytorch.org/docs/stable/distributions.html
  • torch.distributions package contains parameterizable probability distributions and sampling functions.
  • This allows the construction of stochastic computation graphs and stochastic gradient estimators for optimization.

REINFORCE Algorithm (example)

  • In practice, we would sample an action from the output of a network, and apply this action in an environment.
  • Then, use log_prob to construct an equivalent loss function.
  • With a categorical policy, we can implement REINFORCE.
\[\Delta\theta = \alpha r \frac{\partial\log p(a \vert \pi^\theta(s))}{\partial\theta}.\]

The code of implementing REINFORCE is as follows:

probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

Categorical

  • probs
    • 1D: each element is the relative probability of sampling the class at that index.
    • 2D: it is treated as a batch of relative probability vectors.
probs = torch.tensor([ 0.25, 0.25, 0.25, 0.25 ])
m = Categorical(probs)
m.sample()  # equal probability of 0, 1, 2, 3
'''
tensor(3)
'''

MultivariateNormal

MultivariateNormal(loc, covariance_matrix=None, precision_matrix=None, scale_tril=None, validate_args=None)
  • Create a multivariate Normal/Gaussian distribution parameterized by a mean vector and a covariance matrix.
  • loc: mean of the distribution.
  • covariance_matrix: positive-definite corvariance matrix.
  • precision_matrix: positive-definite precision matrix.
  • scale_tril: lower-triangular factor of covariance, with positive-valued diagonal.
m = MultivariateNormal(torch.zeros(2), torch.eye(2))
action = m.sample()  # sample from normally distributed with mean=`[0,0]` and covariance_matrix=`I`
action_logprob = m.log_prob(action)  # compute the log probability of action
entropy = m.entropy()  # compute the entropy of the distribution
'''
tensor([-0.2102, -0.5429])
'''

Normal

Normal(loc, scale, validate_args=None)
  • Create a normal/Gaussian distribution parameterized by loc and scale.
  • loc: mean of the distribution (mu).
  • scale: standard deviation of the distribution (sigma).
m = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
m.sample()  # normally distributed with loc=0 and scale=1
'''
tensor([ 0.1046])
'''

Uniform

Uniform(low, high, validate_args=None)
  • Generate uniformly distributed random samples from the half-open interval [low, high).
m = Uniform(torch.tensor([0.0]), torch.tensor([5.0]))
m.sample()  # uniformly distributed in the range [0.0, 5.0)
'''
tensor([ 2.3418])
'''

KL Divergence

\[KL(p \Vert q) = \int p(x) \log \frac{p(x)}{q(x)}dx.\]
torch.distributions.kl.kl_divergence(p, q)

Loss Function

  • Take the (output, target) pair as inputs, compute a value that estimate how far away the output is from the target.
  • There are several different loss functions under nn package: nn.MSELoss(), all described here.
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

'''
tensor(1.8713, grad_fn=<MseLossBackward>)
'''

Backprop

  • Use loss.backward() to backpropagate the errors.
  • net.zero_grad(): Need to clear the existing gradients, else gradients will be accumulated to existing gradients.
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

Output:

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0166,  0.0530,  0.0216, -0.0270, -0.0130, -0.0092])

Update the Weights

Update rule (SGD)

\[W = W - lr * g.\]

Implement use Python

lr = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * lr)

Implement use PyTorch

  • We can use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
  • torch.optim: implement all the methods above.
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

Save and Load the Model

Only save the parameters of the model

torch.save(the_model.state_dict(), PATH)

Load the model parameters:

the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))

Save the whole model

torch.save(the_model, PATH)

Load the model:

the_model = torch.load(PATH)

Data Parallelism

Use multiple GPUs

  • PyTorch will only use one GPU by default.
  • Use multiple GPUs using DataParallel: model = nn.DataParallel(model).
  • DataParallel splits the data automatically and sends job orders to multiple models on several GPUs.
  • After each model finishes the job, DataParallel collects and merges the results before returning it to you.
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())
'''
Let's use 2 GPUs!
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
'''

Part use CPU and part use GPU

device = torch.device("cuda:0")

class DistributedModel(nn.Module):

    def __init__(self):
        super().__init__(
            embedding=nn.Embedding(1000, 10),
            rnn=nn.Linear(10, 10).to(device),
        )

    def forward(self, x):
        # Compute embedding on CPU
        x = self.embedding(x)

        # Transfer to GPU
        x = x.to(device)

        # Compute RNN on GPU
        x = self.rnn(x)
        return x

Sampler

from torch.utils.data.sampler import BatchSampler, SubsetRandomSampler

buffer = [[1,2],[3,4],[5,6],[7,8],[9,10]]
for index in BatchSampler(SubsetRandomSampler(range(len(buffer))), 2, False):
    print(index)   # print the index of sampled buffer
'''
[0, 1]
[3, 4]
[2]
'''

for index in BatchSampler(SubsetRandomSampler(range(len(buffer))), 2, True):
    print(index)
'''
[1, 3]
[0, 2]
'''

References