- Hello everyone, welcome to CS231. I'm Song Han. Today I'm
going to give a guest lecture on the efficient methods and
hardware for deep learning. So I'm a fifth year PhD
candidate here at Stanford, advised by Professor Bill Dally. So, in this course we have seen
a lot of convolution neural networks, recurrent
neural networks, or even since last time, the
reinforcement learning. They are spanning a lot of applications. For example, the self-=driving
car, machine translation, AlphaGo and Smart Robots. And it's changing our
lives, but there is a recent trend that in order to
achieve such high accuracy, the models are getting larger and larger. For example for ImageNet
recognition, the winner from 2012 to 2015, the model
size increased by 16X. And just in one year,
for Baidu's deep speech just in one year, the training
operations, the number of training operations increased by 10X. So such large model
creates lots of problems, for example the model size
becomes larger and larger so it's difficult for
them to be deployed either on those for example,
on the mobile phones. If the item is larger
than 100 megabytes, you cannot download until
you connect to Wi-Fi. So those product managers
and for example Baidu, Facebook, they are very sensitive
to the size of the binary size of their model. And also for example, the
self-driving car, you can only do those on over-the-air
update for the model if the model is too large,
it's also difficult. And the second challenge
for those large models is that the training speed is extremely slow. For example, the ResNet152,
which is only a few, less than 1% actually, more
accurate than ResNet101. Takes 1.5 weeks to train on four Maxwell M40 GPUs for example. Which greatly limits either
we are doing homework or if the researcher's
designing new models is getting pretty slow. And the third challenge
for those bulky model is the energy efficiency. For example, the AlphaGo
beating Lee Sedol last year, took 2000 CPUs and 300
GPUs, which cost $3,000 just to pay for the electric
bill, which is insane. So either on those embedded
devices, those models are draining your battery
power for on data-center increases the total cost
of ownership of maintaining a large data-center. For example, Google in
their blog, they mentioned if all the users using the
Google Voice Search for just three minutes, they have
to double their data-center. So that's a large cost. So reducing such cost is very important. And let's see where is
actually the energy consumed. The large model means
lots of memory access. You have to access, load
those models from the memory means more energy. If you look at how much
energy is consumed by loading the memory versus how much is
consumed by multiplications and add those arithmetic
operations, the memory access is more than two or three
orders of magnitude, more energy consuming than
those arithmetic operations. So how to make deep
learning more efficient. So we have to improve
energy efficiency by this Algorithm and Hardware Co-Design. So this is the previous
way, which is our hardware. For example, we have some
benchmarks say Spec 2006 and then run those
benchmarks and tune your CPU architectures for those benchmarks. Now what we should do is
to open up the box to see what can we do from algorithm
side first and see what is the optimum question
mark processing unit. That breaks the
boundary between the algorithm hardware to improve
the overall efficiency. So today's talk, I'm going
to have the following agenda. We are going to cover four
aspects: The algorithm hardware and inference and training. So they form a small two by
two matrix, so includes the algorithm for efficient inference, hardware for efficient inference and the algorithm for efficient training, and lastly, the hardware
for efficient training. For example, I'm going
to cover the TPU, I'm going to cover the Volta. But before I cover those
things, let's have three slides for Hardware 101. A brief introduction of
the families of hardware in such a tree. So in general, we can
have roughly two branches. One is general purpose hardware. It can do any applications
versus the specialized hardware, which is tuned
for a specific kind of applications, a domain of applications. So the general purpose
hardware includes, the CPU or the GPU, and their
difference is that CPU is latency oriented, single threaded. It's like a big elephant. While the GPU is throughput oriented. It has many small though
weak threads, but there are thousands of such small weak cores. Like a group of small ants,
where there are so many ants. And specialized hardware,
roughly there are FPGAs and ASICs. So FPGA stand for Field
Programmable Gate Array. So it is programmable, hardware
programmable so its logic can be changed. So it's cheaper for you to try
new ideas and do prototype, but it's less efficient. It's in the middle between
the general purpose and pure ASIC. So ASIC stands for Application
Specific Integrated Circuit. It has a fixed logic, just designed for a certain application. For example deep learning. And Google's TPU is a kind of
ASIC and the neural networks we train on, the earlier GPUs is here. And another slide for
Hardware 101 is the number representations. So in this slide, I'm going
to convey you the idea that all the numbers in computer
are not represented by a real number. It's not a real number, but
they are actually discrete. Even for those floating
point with your 32 Bit. Floating point numbers, their
resolution is not perfect. It's not continuous, but it's discrete. So for example FP32, meaning
using a 32 bit to represent a floating point number. So there are three components
in the representation. The sign bit, the
exponent bit, the mantissa, and the number it represents
is shown by minus 1 to the S times 1.M times 2 to the exponent. So similar there is FP16,
using a 16 bit to represent a floating point number. In particular, I'm going
to introduce Int8, where the core TPU use, using an
integer to represent a fixed point number. So we have a certain number
of bits for the integer. Followed by a radix point,
if we put different layers. And lastly, the fractional bits. So why do we prefer those
eight bit, or 16 bit rather than those traditional like the 32 bit floating point. That's the cost. So, I generated the figure
from 45 nanometer technology about the energy cost versus
the area cost for different operations. In particular, let's see
here, go you from 32 bit to 16 bit, we have about four
times reduction in energy and also about four times
reduction in the area. Area means money. Every millimeter square takes
money to take out a chip So it's very beneficial for
hardware design to go from 32 bit to 16 bit. That's why you hear NVIDIA
from Pascal Architecture, they said they're
starting to support FP16. That's the reason why it's so beneficial. For example, previous battery
level could last four hours, now it becomes 16 hours. That's what it means to reduce the energy cost by four times. But here still, there's a
problem of large energy costs for reading the memory. And let's see how can we deal
with this memory reference so expensive, how do we deal
with this problem better? So let's switch gear and
come to our topic directly. So let's first introduce
algorithm for efficient inference. So I'm going to cover six topics,
this is a really long slide. So I'm going to relatively fast. So the first idea I'm going
to talk about is pruning. Pruning the neural networks. For example, this is
original neural network. So what I'm trying to do is,
can we remove some of the weight and still have the same accuracy? It's like pruning a tree, get rid of those redundant connections. This is first proposed by
Professor Yann LeCun back in 1989, and I revisited this problem,
26 years later, on those modern deep neural nets
to see how it works. So not all parameters are useful actually. For example, in this case, if
you want to fit a single line, but you're using a quadratic
term, apparently the 0.01 is a redundant parameter. So I'm going to train the
connectivity first and then prune some of the connections. And then train the remaining weights, and through this process, it regulates. And as a result, I can reduce
the number of connections, and annex that from 16
million parameters to only six million parameters,
which is 10 times less the computation. So this is the accuracy. So the x-axis is how much
parameters to prune away and the y-axis is the accuracy you have. So we want to have less
parameters, but we also want to have the same accuracy as before. We don't want to sacrifice accuracy, For example at 80%, we
locked zero away left 80% of the parameters, but
accuracy jumped by 4%. That's intolerable. But the good thing is that
if we retrain the remaining weights, the accuracy
can fully recover here. And if we do this process iteratively by pruning and retraining,
pruning and retraining, we can fully recover the
accuracy not until we are prune away 90% of the parameters. So if you go back to home
and try it on your Ipad or notebook, just zero away
50% of the parameters say you went on your homework,
you will astonishingly find that accuracy actually doesn't hurt. So we just mentioned
convolution neural nets, how about RNNs and LSTMs, so I
tried with this neural talk. Again, pruning away 90% of
the rates doesn't hurt the blue score. And here are some visualizations. For example, the original
picture, the neural talk says a basketball player in a
white uniform is playing with a ball. Versus pruning away 90% it
says, a basketball player in a white uniform is
playing with a basketball. And on and so on. But if you're too aggressive,
say you prune away 95% of the weights, the
network is going to get drunk. It says, a man in a red shirt
and white and black shirt is running through a field. So there's really a limit,
a threshold, you have to take care of during the pruning. So interestingly, after
I did the work, did some resource and research and
find actually the same pruning procedure actually
happens to human brain as well. So when we were born, there
are about 50 trillion synapses in the brain. And at one year old, this number
surged into 1,000 trillion. And as we become adolescent,
it becomes smaller actually, 500 trillion in the end,
according to the study by Nature. So this is very interesting. And also, the pruning changed
the weight distribution because we are removing
those small connections and after we retrain them,
that's why it becomes soft in the end. Yeah, question. - [Student] Are you trying
to mean that it terms of your mixed weights
during the training will be just set at zero and
just start from scratch? And these start from the
things that are at zero. - Yeah. So the question is,
how do we deal with those zero connections? So we force them to be zero
in all the other iterations. Question? - [Student] How do you
pick which rates to drop? - Yeah so very simple. Small
weights, drop it, sort it. If it's small, just-- - [Student] Any threshold that I decide? - Exactly, yeah. So the next idea, weight sharing. So now we have, remember
our end goal is to remove connections so that we can
have less memory footprint so that we can have more
energy efficient deployment. Now we have less number
of parameters by pruning. We want to have less number
of bits per parameter so they're multiplied together
they get a small model. So the idea is like this. Not all numbers, not all the weights has to be the exact number. For example, 2.09, 2.12 or
all these four weights, you just put them using 2.0 to represent them. That's enough. Otherwise too accurate number
is just leads to overfitting. So the idea is I can
cluster the weights if they are similar, just using
a centroid to represent the number instead of using
the full precision weight. So that every time I do the
inference, I just do inference on this single number. For example, this is a
four by four weight matrix in a certain layer. And what I'm going to do is do
k-means clustering by having the similar weight
sharing the same centroid. For example, 2.09, 2.12, I store index of three pointing to here. So that, the good thing is
we need to only store the two bit index rather than the
32 bit, floating point number. That's 16 times saving. And how do we train such neural network? They are binded together, so
after we get the gradient, we color them in the same
pattern as the weight and then we do a group by
operation by having all the in that weights with the
same index grouped together. And then we do a reduction
by summing them up. And then multiplied by the learning rate subtracted from the original centroid. That's one iteration of
the SGD for such weight shared neural network. So remember previously,
after pruning this is what the weight
distribution like and after weight sharing, they become discrete. There are only 16 different
values here, meaning we can use four bits to
represent each number. And by training on such
weight shared neural network, training on such extremely
shared neural network, these weights can adjust. It is the subtle changes
that compensated for the loss of accuracy. So let's see, this is the
number of bits we give it, this is the accuracy
for convolution layers. Not until four bits, does
the accuracy begin to drop and for those fully connected
layers, very astonishingly, it's not until two bits,
only four number, does the accuracy begins to drop. And this result is per layer. So we have covered two methods,
pruning and weight sharing. What if we combine these
two methods together. Do they work well? So by combining those methods,
this is the compression ratio with the smaller on the left. And this is the accuracy. We can combine it together
and make the model about 3% of its original
size without hurting the accuracy at all. Compared with the each
working individual data by 10%, accuracy begins to drop. And compared with the
cheap SVD method, this has a better compression ratio. And final idea is we can
apply the Huffman Coding to use more number of bits
for those infrequent numbers, infrequently appearing weights
and less number of bits for those more frequently
appearing weights. So by combining these three
methods, pruning, weight sharing, and also Huffman
Coding, we can compress the neural networks, state-of-the-art
neural networks, ranging from 10x to
49x without hurting the prediction accuracy. Sometimes a little bit better. But maybe that is noise. So the next question is, these
models are just pre-trained models by say Google, Microsoft. Can we make a compact
model, a pump compact model to begin with? Even before such compression? So SqueezeNet, you may have
already worked with this neural network model in a homework. So the idea is we are having
a squeeze layer here to shield at the three by three
convolution with fewer number of channels. So that's where squeeze comes from. And here we have two branches,
rather than four branches as in the inception model. So as a result, the model
is extremely compact. It doesn't have any
fully connected layers. Everything is fully convolutional. The last layer is a global pooling. So what if we apply deep
compression algorithm on such already compact
model will it be getting even smaller? So this is AlexNet after
compression, this is SqueezeNet. Even before compression, it's
50x smaller than AlexNet, but has the same accuracy. After compression 510x
smaller, but the same accuracy only less than half a megabyte. This means it's very easy
to fit such a small model on the cache, which is literally tens of megabyte SRAM. So what does it mean? It's possible to achieve speed up. So this is the speedup, I
measured if all these fully connected layers only for
now, on the CPU, GPU, and the mobile GPU, before pruning and after pruning the weights, and on average, I observed
a 3x speedup in a CPU, about 3X speedup on the GPU, and roughly 5x speedup on
the mobile GPU, which is a TK1. And so is the energy efficiency. In an average improvement
from 3x to 6x on a CPU, GPU, and mobile GPU. And these ideas are
used in these companies. Having talked about when
pruning and when sharing, which is a non-linear quantization method and we're going to talk about
quantization, which is, why do they use in the TPU design? All the TPU designs use at
only eight bit for inference. And the way, how they can
use that is because of the quantization. And let's see how does it work. So quantization has this
complicated figure, but the intuition is very simple. You run the neural network
and train it with the normal floating point numbers. And quantize the weight
and activations by gather the statistics for each layer. For example, what is the maximum number,
minimum number, and how many bits are enough to represent this dynamic range. Then you use that number of
bits for the integer part and the rest of the eight bit or seven bit for the other part of
the 8 bit representation. And also we can fine tune in
the floating point format. Or we can also use feed
forward with fixed point and back propagation with
update with the floating point number. There are lots of different
ideas to have better accuracy. And this is the result,
for how many number of bits versus what is the accuracy. For example, using a fixed,
8 bit, the accuracy for GoogleNet doesn't drop significantly. And for VGG-16, it also
remains pretty well for the accuracy. While circling down to
a six bit, the accuracy begins to drop pretty dramatically. Next idea, low rank approximation. It turned out that for
a convolution layer, you can break it into
two convolution layers. One convolution here, followed
by a one by one convolution. So that it's like you
break a complicated problem into two separate small problems. This is for convolution layer. As we can see, achieving about 2x speedup, there's almost
no loss of accuracy. And achieving a speedup
of 5x, roughly a 6% loss of accuracy. And this also works for
fully connected layers. The simplest idea is using
the SVD to break it into one matrix into two matrices. And follow this idea, this
paper proposes to use the Tensor Tree to break down one
fully connected layer into a tree, lots of fully connected layers. That's why it's called a tree. So going even more crazy, can we use only two weights or three weights
to represent a neural network? A ternary weight or a binary weight. We already seen this distribution
before, after pruning. There's some positive
weights and negative weights. Can we just use three numbers,
just use one, minus one, zero to represent the neural network. This is our recent paper
clear that we maintain a full precision weight
during training time, but at inference time, we
only keep the scaling factor and the ternary weight. So during inference, we
only need three weights. That's very efficient and
making the model very small. This is the proportion
of the positive zero and negative weights, they can
change during the training. So is their absolute value. And this is the visualization of kernels by this trained ternary quantization. We can see some of them are
a corner detector like here. And also here. Some of them are maybe edge detector. For example, this filter some of them are corner detector like here this filter. Actually we don't need
such fine grain resolution. Just three weights are enough. So this is the validation
accuracy on ImageNet with AlexNet. So the threshline is the baseline accuracy with floating point 32. And the red line is our result. Pretty much the same accuracy
converged compared with the full precision weights. Last idea, Winograd Transformation. So this about how do we
implement deep neural nets, how do we implement the convolutions. So this is the conventional direct convolution implementation method. The slide credited to
Julien, a friend from Nvidia. So originally, we just do the element wise do a dot product for those
nine elements in the filter and nine elements in the
image and then sum it up. For example, for every
output we need nine times C number of multiplication and adds. Winograd Convolution is another
method, equivalent method. It's not lost, it's an
equivalent method proposed at first through this paper, Fast Algorithms for Convolution Neural Networks. That instead of directly
doing the convolution, move it one by one, at first it
transforms the input feature map to another feature map. Which contains only the
weight, contains only 1, 0.5, 2 that can efficiently
implement it with shift. And also transform the filter
into a four by four tensor. So what we are going to do here
is sum over c and do an element-wise element-wise product. So there are only 16
multiplications happening here. And then we do a inverse
transform to get four outputs. So the transform and the
inverse transform can be amortized and the multiplications,
whether it can ignored. So in order to get four output,
we need nine times channel times four, which is 36 times channel. Multiplications originally
for the direct convolution but now we need 16
times C of our output So that is 2.25x less
number of multiplications to perform the exact same multiplication. And here is a speedup. 2.25x, so theoretically,
2.25x speedup and in real, from cuDNN 5 they incorporated such Winograd Convolution algorithm. This is on the VGG net I
believe, the speedup is roughly 1.7 to 2x speedup. Pretty significant. And after cuDNN 5, the
cuDNN begins to use the Winograd Convolution algorithm. Okay, so far we have covered
those efficient algorithms for efficient inference. We covered pruning, weight
sharing, quantization, and also Winograd binary and ternary. So now let's see what is the
optimal hardware for those efficient inference? And what is a Google TPU? So there are a wide
range of domain specific architectures or ASICS
for deep neural networks. They have a common goal
is to minimize the memory access to save power. For example the Eyeriss from
MIT by using the RS Dataflow to minimize the off chip direct access. And DaDiannao from China
Academy of Science, buffered all the weights on
chip DRAM instead of having to go to off-chip DRAM. So the TPU from Google is
using eight bit integer to represent the numbers. And at Stanford I proposed
the EIE architecture that support those compressed and sparse deep neural network inference. So this is what the TPU looks like. It's actually smartly, can
be put into the disk drive up to four cards per server. And this is the high-level architecture for the Google TPU. Don't be overwhelmed, it's
actually, the kernel part here, is this giant matrix
multiplication unit. So it's a 256 by 256
matrix multiplication unit. So in one single cycle,
it can perform 64 kilo those number of multiplication
and accumulate operations. So running 700 Megahertz,
the throughput is 92 Teraops per second because it's actually integer operation. So we just about 25x as GPU
and more than 100x at the CPU. And notice, TPU has a really
large software-managed on-chip buffer. It is 24 megabytes. The cache for the CPU the
L3 cache is already 16 megabytes. This is 24 megabytes
which is pretty large. And it's powered by
two DDR3 DRAM channels. So this is a little weak
because the bandwidth is only 30 gigabytes per second
compared with the most recent GPU that HBM, 900
Gigabytes per second. The DDR4 is released in 2014,
so that makes sense because the design is a little during
that day, used the DDR3. But if you're using DDR4 or
even high-bandwidth memory, the performance can be even boosted. So this is a comparison
about Google's TPU compared with the CPU, GPU of this K80
GPU by the way, and the TPU. So the area is pretty much
smaller, like half the size of a CPU and GPU and the power
consumption is roughly 75 watts. And see this number, the
peak teraops per second is much higher than the
CPU and GPU is, about 90 teraops per second, which is pretty high. So here is a workload. Thanks to David sharing the slide. This is the workload at Google. They did a benchmark on these TPUs. So it's a little interesting
that convolution neural nets only account for 5% of
data-center workload. Most of them is multilayer perception, those fully connected layers. About 61% maybe for ads, I'm not sure. And about 29% of the workload
in data-center is the Long Short Term Memory. For example, speech recognition, or machine translation, I suspect. Remember just now we have seen there are 90 teraops per second. But what actually number
of teraops per second can be achieved? This is a basic tool to
measure the bottleneck of a computer system. Whether you are bottlenecked
by the arithmetic or you are bottlenecked by
the memory bandwidth. It's like if you have a bucket, the lowest part of the
bucket determines how much water we can hold in the bucket. So in this region, you are bottlenecked by the memory bandwidth. So the x-axis is the arithmetic intensity. Which is number of floating
point operations per byte the ratio between the
computation and memory of bandwidth overhead. So the y-axis, is the actual
attainable performance. Here is the peak performance for example. When you do a lot of operation
after you fetch a single piece of data, if you
can do a lot of operation on top of it, then you are
bottlenecked by the arithmetic. But after you fetch a lot
of data from the memory, but you just do a tiny
little bit of arithmetic, then you will be bottlenecked
by the memory bandwidth. So how much you can fetch
from the memory determines how much real performance you can get. And remember there is a ratio. When it is one here, this
region it happens to be the same as the turning point is the actual memory bandwidth of your system. So let's see what is the life for the TPU. The TPU's peak performance is really high, about 90 Tops per second. For those convolution nets,
they are pretty much saturating the peak performance. But there are lot of neural
networks that has a utlitization less than 10%, meaning that 90 T-ops
per second is actually achieves about three to 12
T-ops per second in real case. But why is it like that? The reason is, in order to
have those real-time guarantee that the user not wait for
too long, you cannot batch a lot of user's images
or speech voice data at the same time. So as a result, for those
fully connect layers, they have very little reuse,
so they are bottlenecked by the memory bandwidth. For those convolution neural
nets, for example this one, this blue one, that
achieve 86, which is CNN0. The ratio between the ops and the number of memory is the highest. It's pretty high, more than
2,000 compared with other multilayer perceptron or
long short term memory the ratio is pretty low. So this figure compares, this
is the TPU and this one is the CPU, this is the GPU. Here is memory bandwidth,
the peak memory bandwidth at a ratio of one here. So TPU has the highest memory bandwidth. And here is where are
these neural networks lie on this curve. So the asterisk is for the TPU. It's still higher than other dots, but if you're not comfortable
with this log scale figure, this is what it's like
putting it in linear roofline. So pretty much everything
disappeared except for the TPU results. So still, all these lines,
although they are higher than the CPU and GPU,
it's still way below the theoretical peak operations per second. So as I mentioned before,
it is really bottlenecked by the low latency requirement
so that it can have a large batch size. That's why you have low
operations per byte. And how do you solve this problem? You want to have less
number of memory footprint so that it can reduce the
memory bandwidth requirement. One solution is to compress
the model and the challenge is how do we build a hardware
that can do inference directly on the compressed model? So I'm going to introduce my
design of EIE, the Efficient Inference Engine, which
deals with those sparse and the compressed model to
save the memory bandwidth. And the rule of thumb, like
we mentioned before is taking out one bit of sparsity first. Anything times zero is zero. So don't store it, don't compute on it. And second idea is, you don't
need that much full precision, but you can approximate it. So by taking advantage
of the sparse weight, we get about a 10x saving in
the computation, 5x less memory footprint. The 2x difference is
due to index overhead. And by taking advantage
of the sparse activation, meaning after bandwidth,
if activation is zero, then ignore it. You save another 3x of computation. And then by such weight sharing mechanism, you can use four bits to
represent each weight rather than 32 bit. That's another eight times
saving in the memory footprint. So this is physically, logically
how the weights are stored. A four by eight matrix,
and this is how physically they are stored. Only the non-zero weights are stored. So you don't need to store those zeroes. You'll save the bandwidth
fetching those zeroes. And also I'm using the
relative index to further save the number of memory overhead. So in the computation
like this figure shows, we are running the
multiplication only on non-zero. If it's zero, then skip it. Only broadcast it to the non-zero weights and if it is zero, skip it. If it's a non-zero, do the multiplication. In another cycle, do the multiplication. So the idea is anything
multiplied by zero is zero. So this is a little complicated, I'm going to go very quickly. I'm going to have a lookup
table that decode the four bit weight into the 16 bit
weight and using the four bit relative index passed
through address accumulator to get the 16 bit absolute index. And this is what the hardware architecture like in the high level. You can feel free to refer
to my paper for detail. Okay speedup. So using such efficient
hardware architecture and also model compression,
this is the original result we have seen for
CPU, GPU, mobile GPU. Now EIE is here. 189 times faster than the
CPU and about 13 times faster than the GPU. So this is the energy
efficiency on the log scale, it's about 24,000x more
energy efficient than a CPU and about 3000x more energy
efficient than a GPU. It means for example,
previously if your battery can last for one hour, now it can last for 3000 hours for example. So if you say, ASIC is always
better than CPUs and GPUs because it's customized hardware. So this is comparing EIE with
the peer ASIC, for example DaDianNao and the TrueNorth. It has a better throughput,
better energy efficiency by order of magnitude,
compared with other ASICs. Not to mention that CPU, GPU and FPGAs. So we have covered half of the journey. We mentioned inference, we pretty much covered everything for inference. Now we are going to switch
gear and talk about training. How do we train neural
networks efficiently, how do we train it faster? So again, we are starting
with algorithm first, efficient algorithms
followed by the hardware for efficient training. So for efficient training
algorithms, I'm going to mention four topics. The first one is parallelization,
and then mixed precision training, which was just
released about one month ago and at NVIDIA GTC,
so it's fresh knowledge. And then model distillation,
followed by my work on Dense-Sparse-Dense training,
or better Regularization technique. So let's start with parallelization. So this figure shows, anyone in the hardware community. Most are very familiar with this figure. So as time goes by, what is the trend? For the number of transistors
is keeping increasing. But the single threaded
performance is getting plateaued in recent years. And also the frequency is getting
plateaued in recent years. Because of the power
constraint, to stop not scaling. And interesting thing is the
number of cores is increasing. So what we really need
to do is parallelization. How do we parallelize the
problem to take advantage of parallel processing? Actually there are a lot of
opportunities for parallelism in deep neural networks. For example, we can do data parallel. For example, feeding two
images into the same model and run them at the same time. This doesn't affect
latency for a single input. It doesn't make it shorter,
but it makes batch size larger basically if you have four
machines our effective batch size becomes four times as before. So it requires the
coordinated weight update. For example, this is a paper from Google. There is a parameter server
as a master and a couple of slaves running their own piece
of training data and update the gradient to the parameter
server and get the updated weight for them individually, that's how data parallelism is handled. Another idea is there could
be a model parallelism. You can sublet your model and handle it to different processors
or different threads. For example, there's this image,
you want to run convolution on this image that is
six dimension for loop. What you can do is you
can cut the input image by two by two blocks so that
each thread, or each processor handles one fourth of the image. Although there's a small
halo here in between you have to take care of. And also, you can parallelize by the output or input feature map. And for those fully connect layers, how do we parallelize the model? It's even simpler. You can cut the model into half and hand it to different threads. And the third idea, you can even do hyper-parameter parallel. For example, you can tune
your learning rate, your weight decay for different machines for those coarse-grained parallelism. So there are so many
alternatives you have to tune. Small summary of the parallelism. There are lots of parallelisms
in deep neural networks. For example, with data
parallelism, you can run multiple training images, but you
cannot have unlimited number of processors because you
are limited by batch size. If it's too large, stochastic gradient descent becomes gradient descent, that's not good. You can also run the model parallelism. Split the model, either
by cutting the image or cutting the convolution weights. Either cutting the image or cutting the fully connected layers. So it's very easy to get 16
to 64 GPUs training one model in parallel, having very good speedup. Almost linear speedup. Okay, next interesting
thing, mixed precision with FP16 or FP32. So remember in the
beginning of this lecture, I had a chart showing the
energy and area overhead for a 16 bit versus a 32 bit. Going from 32 bit to 16 bit,
you save about 4x the energy and 4x the area. So can we train a deep
neural network with such low precision with floating point
16 bit rather than 32 bit? It turns out we can do that partially. By partially, I mean we
need FP32 in some places. And where are those places? So we can do the multiplication
in 16 bit as input. And then we have to do the summation in 32 bit accumulation. And then convert the result
to 32 bit to store the weight. So that's where the mixed
precision comes from. So for example, we have
a master weight stored in floating point 32, we down
converted it to floating point 16 and then we do the
feed forward with 16 bit weight, 16 bit activation,
we get a 16 bit activation here in the end when we
are doing back propagation of the computation is also done
with floating point 16 bit. Very interesting here, for
the weights we get a floating point 16 bit gradient here for the weight. But when we are doing the
update, so W plus learning rate times the gradient,
that operation has to be done in 32 bit. That's where the mixed
precision is coming from. And see there are two
colors, which here is 16 bit, here is the 32 bit. That's where the mixed
precision comes from. So does such low precision
sacrifice your prediction accuracy for your model? So this is the figure from
NVIDIA just released a couple of weeks ago actually. Thanks to Paulius giving me the slide. The convergence between
floating point 32 versus the multi tensor up, which
is basically the mixed precision training, are
actually pretty much the same for convergence. If you zoom it in a little bit, they are pretty much the same. And for ResNet, the mixed
precision sometimes behaves a little better than the
full precision weight. Maybe because of noise. But in the end, after you
train the model, this is the result of AlexNet,
Inception V3, and ResNet-50 with FP32 versus FP16
mixed precision training. The accuracy is pretty much the same for these two methods. A little bit worse, but not by too much. So having talked about the
mixed precision training, the next idea is to train
with model distillation. For example, you can have
multiple neural networks, Googlenet, Vggnet, Resnet for example. And the question is, can
we take advantage of these different models? Of course we can do model
ensemble, can we utilitze them as teacher, to teach a small
junior neural network to have it perform as good as the
senior neural network. So this is the idea. You have multiple large
powerful senior neural networks to teach this student model. And hopefully it can get better results. And the idea to do that
is, instead of using this hard label, for example for
car, dog, cat, the probability for dog is 100%, but the
output of the geometric ensemble of those large
teacher neural networks maybe the dog has 90%
and the cat is about 10%, and the magic happens here. You want to have a
softened result label here. For example, the dog
is 30%, the cat is 20%. Still the dog is higher than the cat. So the prediction is
still correct, but it uses this soft label to train
the student neural network rather than use this hard label to train the student neural network. And mathematically, you
control how much do you make it soft by this temperature
during the soft max controlling by this temperature. And the result is that,
starting with the trained model that classifies 58.9% of
the test frames correctly, the new model converges to 57%. Only train on 3% of the data. So that's the magic for model distillation using this soft label. And the last idea is my recent paper using a better regularization
to train deep neural nets. We have seen these two figures before. We pruned the neural
network, having less number of weights, but have the same accuracy. Now what I did is to
recover and to retrain those weights shown in red
and make everything train out together to increase
the model capacity after it is trained at a low dimensional space. It's like you learn the trunk
first and then gradually add those leaves and
learn everything together. It turns out, on ImageNet it
performs relatively about 1% to 4% absolute improvement of accuracy. And is also general purpose,
works on long-short term memory and also recurrent neural
nets collaborated with Baidu. So I also open sourced
this special training model on the DSD Model Zoo, where
there are trained, all these models, GoogleNet, VGG,
ResNet, and also SqueezeNet, and also AlexNet. So if you are interested,
feel free to check out this Model Zoo and compare it
with the Caffe Model Zoo. Here's some examples on
dense-spare-dense training helps with image capture. For example, this is a
very challenging figure. The original baseline of
neural talk says a boy in a red shirt is climbing a rock wall. And the sparse model says
a young girl is jumping off a tree, probably
mistaking the hair with either the rock or the tree. But then sparse-dense
training by using this kind of regularization on a low
dimensional space, it says a young girl in a pink shirt
is swinging on a swing. And there are a lot of examples
due to the limit of time, I will not go over them one by one. For example, a group of
people are standing in front of a building, there's no building. A group of people are walking in the park. Feel free to check out the
paper and see more interesting results. Okay finally, we come to
hardware for efficient training. How to we take advantage of the algorithms we just mentioned. For example, parallelism,
mixed precision, how are the hardware designed to actually take advantage of such features. First GPUs, this is the
Nvidia PASCAL GPU, GP100, which was released last year. So it supports up to 20 Teraflops on FP16. It has 16 gigabytes of
high bandwidth memory. 750 gigabytes per second. So remember, computation
and memory bandwidth are the two factors determines
your overall performance. Whichever is lower, it will suffer. So this is a really high
bandwidth, 700 gigabytes compared with DDR3 is just 10
or 30 gigabytes per second. Consumes 300 Watts and it's done in 16 nanometer process and have a 160 gigabytes
per second NV Link. So remember we have
computation, we have memory, and the third thing is the communication. All three factors has to
be balanced in order to achieve a good performance. So this is very powerful,
but even more exciting, just about a month ago,
Jensen released the newest architecture called the Volta GPUs. And let's see what is
inside the Volta GPU. Just released less than a
month ago, so it has 15 of FP32 teraflops and what
is new here, there is 120 Tensor T-OPS, so specifically
designed for deep learning. And we'll later cover
what is the tensor core. And what is this 120 coming from. And rather than 750
gigabytes per second, this year, the HBM2, they are
using 900 gigabytes per second memory bandwidth. Very exciting. And 12 nanometer process has
a die size of more than 800 millimeters square. A really large chip and
supported by 300 gigabytes per second NVLink. So what's new in Volta, the
most interesting thing for us for deep learning, is this
thing called Tensor Core. So what is a Tensor Core? Tensor Core is actually
an instruction that can do the four by four matrix
times a four by four matrix. The fused FMA stands Fused
Multiplication and Add in this mixed precision operation. Just in one single clock cycle. So let's discern for a little
bit what does this mean. So mixed precision is exactly
as we mentioned in the last chapter, so we are having
FP16 for the multiplication, but for accumulation, we
are doing it with FP32. That's where the mixed
precision comes from. So let's say how many
operations, if it's four by four by four, it's 64
multiplications then just in one single cycle. That's 12x increase in
the speedup of the Volta compared with the Pascal, which
is released just less year. So this is the result for
matrix multiplication on different sizes. The speedup of Volta over
Pascal is roughly 3x faster doing these matrix multiplications. What we care more is not
only matrix multiplication but actually running the deep neural nets. So both for training and for inference. And for training on
ResNet-50, by taking advantage of this Tensor Core in this V100, it is 2.4x faster than
the P100 using FP32. So on the right hand side,
it compares the inference speedup, given a 7 microsecond
latency requirement. What is the number of images
per second it can process? It has a measurement of throughput. Again, the V100 over
P100, by taking advantage of the Tensor Core, is
3.7 faster than the P100. So this figure gives roughly
an idea, what is a Tensor Core, what is an integer unit, what
is a floating point unit. So this whole figure is a single SM stream multiprocessor. So SM is partitioned into
four processing blocks. One, two, three, four, right? And in each block there
are eight FP64 cores here and 16 FP32 and 16 INT32
cores here, units here. And then there are two of
the new mixed precision Tensor cores specifically
designed for deep learning. And also there are the one
warp scheduler, dispatch unit and Register File, as before. So what is new here is
the Tensor core unit here. So here is a figure comparing
the recent generations of Nvidia GPUs from Kepler to Maxwell to Pascal to Volta. We can see everything
is keeping improving. For example, the boost clock
has been increased from about 800 MHz to 1.4 GHz. And from the Volta generation
there begins to have the Tensor core units here,
which has never existed before. And before the Maxwell, the GPUs are using the GDDR5, and after the Pascal GPU, the HBM begins to came into place, the high-bandwidth memory. 750 gigabytes per second here. 900 gigabytes per second
compared with DDR3, 30 gigabytes per second. And memory size actually
didn't increase by too much, and the power consumption is actually also remaining roughly the same. But giving the increase of
computation, you can fit them in the fixed power envelope
that's still an exciting thing. And the manufacturing process
is actually improving from 28 nanometer, 16 nanometer,
all the way to 12 nanometer. And the chip area are also increasing to 800 millimeter-squared,
that's really huge. So, you may be interested
in the comparison of the GPU with the TPU, right? So how do they compare with each other? So in the original TPU paper, TPU actually designed
roughly in the year of 2015, and this is comparison
of the Pascal P40 GPU released in 2016. So, TPU, the power consumption is lower, is larger on chip memory of 24 megabytes, really large on-chip SRAM
managed by the software. And then both of them
support INT8 operations, while the inferences per second
given a 10 nanometer latency the comparison for TPU is 1X. For the P40 it's about 2X. So, just last week, in the Google I/O, a new nuclear bomb is landed on the Earth. That is the Google Cloud TPU. So now TPU not only support inference, but also support training. So there is a very limited
information we can get beyond this Google Blog. So their Cloud TPU delivers
up to 180 teraflops to train and run machine learning models. And this is multiple Cloud TPU, making it into a TPU pod, which is built with 16
the second generation TPUs and delivers up to 11.5 teraflops of machine learning acceleration. So in the Google Blog, they mentioned that one of the large scale translation models, Google translation models, used
to take a full day to train on 32 of best commercially-available
GPUs, probably P40 or P100, maybe. And now it trains to the same accuracy, just within one afternoon,
with just 1/8 of a TPU pod, which is pretty exciting. Okay, so as a little wrap-up. We covered a lot of stuff, we've mentioned the four dimension space
of algorithm and hardware, inference and training, we
covered the algorithms for inference, for example,
pruning and quantization, Winograd Convolution, binary, ternary, weight sharing, for example. And then the hardware for
the efficient inference. For example, the TPU, that take advantage of INT8, integer 8. And also my design of EIE
accelerator that take advantage of the sparsity, anything
multiplied by zero is zero, so don't store it, don't compute on it. And also the efficient algorithm
for training, for example, how do we do parallelization
and the most recent research on how do we use mixed precision
training by taking advantage of FP16 rather than FP32 to do training which is four times saving the energy and four times saving in the area, which doesn't quite sacrifice
the accuracy you'll get from the training. And also Dense-Sparse-Dense
training using better regularization sparse regularization, and also
the teacher-student model. You have multiple teacher on
your network and have a small student network that you
can distill the knowledge from the teacher in your
network by a temperature. And finally we covered the
hardware for efficient training and introduced two nuclear bombs. One is the Volta GPU, the
other is the TPU version two, the Cloud TPU and also
the amazing Tensor cores in the newest generation of Nvidia GPUs. And we also revealed the
progression of a wide range, the recent Nvidia GPUs
from the Kepler K40, that's actually when
I started my research, what we used in the beginning, all the way to and then K40, M40, and then Pascal and then
finally the exciting Volta GPU. So every year there is a
nuclear bomb in the spring. Okay, a little look ahead in the future. So in the future of the city
we can imagine there are a lot of AI applications using
smart society, smart care, IOT devices, smart retail,
for example, the Amazon Go, and also smart home, a lot of scenarios. And it poses a lot of challenges
on the hardware design that requires the low
latency, privacy, mobility and energy efficiency. You don't want your battery
to drain very quickly. So it's both challenging
and very exciting era for the code design for
both the machine learning deep neural network model architectures and also the hardware architecture. So we have moved from
PC era to mobile era. Now we are in the AI-First era, and hope you are as excited
as I am for this kind of brain-inspired cognitive
computing research. Thank you for your attention,
I'm glad to take questions. [applause] We have five minutes. Of course. - [Student] Can you commercialize
the deep architecture? - The architecture, yeah, some
of the ideas are pretty good. I think there's opportunity. Yeah. Yeah. The question is, what can we
do to make the hardware better? Oh, right, the question is how do we, the challenges and what
opportunity for those small embedded devices around
deep neural network or in general AI algorithms. Yeah, so those are the
algorithm I discussed in the beginning about inference. Here. These are the techniques
that can enable such inference or AI running
on embedded devices, by having less number of
weights, fewer bits per weight, and also quantization,
low rank approximation. The small matrix, same
accuracy, even going to binary, or ternary weights having just two bits to do the computation rather
than 16 or even 32 bit and also the Winograd Transformation. Those are also the enabling
algorithms for those low-power embedded devices. Okay, the question is, if it's
binary weight, the software developers may be not able
to take advantage of it. There is a way to take
advantage of binary weight. So in one register there are 32 bit. Now you can think of it
as a 32-way parallelism. Each bit is a single operation. So say previously we
have 10 ops per second. Now you get 330 ops per second. You can do this bitwise operations. For example, XOR operations. So one register file, one operation becomes 32 operation. So there is a paper called XORmad, they very amazing implemented on the Raspberry Pi using this feature to do real-time detection,
very cool stuff. Yeah. Yeah, so the trade-off is
always so the power area and performance in general,
all the hardware design have to take into account
the performance, the power, and also the area. When machine learning
comes, there's a fourth figure of merit which is the accuracy. What is the accuracy? And there is a fifth one
which is programmability. So how general is your hardware? For example, if Google just
want to use that for AI and deep learning, it's totally fine that we can have a fully
very specialized architecture just for deep learning
to support convolution, multi-layered perception,
long-short-term memory, but GPUS, you also want
to have support for those scientific computing
or graphics, AR and VR. So that's a difference, first of all. And TPU basically is a ASIC, right? It's a very fixed function
but you can still program it with those coarse instructions
so people from Google roughly designed those coarse
granularity instruction. For example, one instruction
just load the matrix, store a matrix, do convolutions, do matrix multiplications. Those coarse-grain instructions and they have a software-managed memory, also called a scratchpad. It's different from
cache where it determines where to evict something
from the cache, but now, since you know the computation pattern, there's no need to do out-of-order execution, to do branch prediction, no such things. Everything is determined,
so you can take the multi of it and maintain a fully
software-managed scratchpad to reduce the data movement
and remember, data movement is the key for reducing
the memory footprint and energy consumption. So, yeah. Mobilia and Nobana architectures
actually I'm not quite familiar, didn't prepare those slides, so, comment it a little bit later, no. Oh, yeah, of course. Those are always and
can certainly be applied to low-power embedded devices. If you're interested, I can show you a... Whoops. Some examples of, oops. Where is that? Of my previous projects
running deep neural nets. For example, on a drone,
this is using a Nvidia TK1 mobile GPU to do real-time
tracking and detection. This is me playing my nunchaku. Filmed by a drone to do the
detection and tracking. And also, this FPGA doing
the deep neural network. It's pretty small. This large, doing the face-alignment and detecting the eyes,
the nose and the mouth, at a pretty high framerate. Consuming only three watts. This is a project I did
at Facebook doing the deep neural nets on the mobile phone to do image classification, for
example, it says it's a laptop, or you can feed it with
an image and it says it's a selfie, has person
and the face, et cetera. So there's lots of opportunity for those embedded or mobile-deployment
of deep neural nets. No, there is a team doing that, but I cannot comment too much, probably. There is a team at Google
doing that sort of stuff, yeah. Okay, thanks, everyone. If you have any questions,
feel free to drop me a e-mail.