Have you ever wondered what’s happening inside
your computer when you load a program or video game? Well, millions of operations are happening,
but perhaps the most common is simply just copying data from a solid-state drive or SSD into dynamic
random-access memory or DRAM. An SSD stores all the programs and data for long-term storage,
but when your computer wants to use that data, it has to first move the appropriate
files into DRAM, which takes time, hence the loading bar. Because your CPU works
only with data after it’s been moved to DRAM, it’s also called working memory or main memory.
The reason why your desktop uses both SSDs and DRAM is because Solid-State Drives permanently
store data in massive 3D arrays composed of a trillion or so memory cells, yielding terabytes of
storage, whereas DRAM temporarily stores data in 2D arrays composed of billions of tiny capacitor
memory cells yielding gigabytes of working memory. Accessing any section of cells in the massive
SSD array and reading or writing data takes about 50 microseconds whereas reading or
writing from any DRAM capacitor memory cell takes about 17 nanoseconds, which is 3000
times faster. For comparison, a supersonic jet going at Mach 3 is around 3000 times faster
than a moving tortoise. So, the speed of 17 nanosecond DRAM versus 50 microsecond SSD is
like comparing a supersonic jet to a tortoise. However, speed is just one factor. DRAM is
limited to a 2D array and temporarily stores one bit per memory cell. For example, this stick
of DRAM with 8 chips holds 16 gigabytes of data, whereas a solid-state drive of a smaller
size can hold 2 terabytes of data, more than 100 times that of DRAM. Additionally,
DRAM requires power to continuously store and refresh the data held in its capacitors.
Therefore, computers use both SSDs and DRAM and, by spending a few seconds of loading time
to copy data from the SSD to the DRAM, and then prefetching, which is the process of
moving data before it’s needed, your computer can store terabytes of data on the SSD and then access
the data from programs that were preemptively copied into the DRAM in a few nanoseconds.
For example, many video games have a loading time to start up the game itself, and then a
separate loading time to load a save file. During the process of loading a save file, all
the 3D models, textures, and the environment of your game state are moved from the SSD into DRAM
so any of it can be accessed in a few nanoseconds, which is why video games have DRAM capacity
requirements. Just imagine, without DRAM, playing a game would be 3,000 times slower.
We covered solid-state drives in other videos, so in this video, we’re going to take a deep
dive into this 16-gigabyte stick of DRAM. First, we’ll see exactly how the CPU communicates
and moves data from an SSD to DRAM. Then we’ll open up a DRAM microchip and see how
billions of memory cells are organized into banks and how data is written to and read from
groups of memory cells. In the process, we’ll dive into the nanoscopic structures inside
individual memory cells and see how each capacitor physically stores 1 bit of data.
Finally, we’ll explore some breakthroughs and optimizations such as the burst buffer and folded
DRAM layouts that enable DRAM to move data around at incredible speeds. A few quick notes.
First, you can find similar DRAM chips inside GPUs, Smartphones, and many other devices, but
with different optimizations. As examples, GPU DRAM or VRAM, located all around the
GPU chip, has a larger bandwidth and can read and write simultaneously, but operates at
a lower frequency, and DRAM in your smartphone is stacked on top of the CPU and is optimized for
smaller packaging and lower power consumption. Second, this video is sponsored by
Crucial. Although they gave me this stick of DRAM to model and use in the
video, the content was independently researched and not influenced by them.
Third, there are faster memory structures in your CPU called cache memory and even faster
registers. All these types of memory create a memory hierarchy, with the main trade-off
being speed versus capacity while keeping prices affordable to consumers and optimizing
the size of each microchip for manufacturing. Fourth, you can see how much of
your DRAM is being utilized by each program by opening your computer’s
resource monitor and clicking on memory. Fifth, there are different generations of DRAM,
and we’ll explore DDR5. Many of the key concepts that we explain apply to prior generations,
although the numbers may be different. Sixth, 17 nanoseconds is incredibly fast!
Electricity travels at around 1 foot per nanosecond, and 17 nanoseconds is about the
time it takes for light to travel across a room. Finally, this video is rather long as it covers
a lot of what there is to know around DRAM. We recommend watching it first at one point two
five times speed, and then a second time at one and a half speed to fully comprehend this
complex technology. Stick around because this is going to be an incredibly detailed video.
To start, a stick of DRAM is also called a Dual Inline Memory Module or DIMM and there are 8
DRAM chips on this particular DIMM. On the motherboard, there are 4 DRAM slots, and when
plugged in, the DRAM is directly connected to the CPU via 2 memory channels that run through
the motherboard. Note that the left two DRAM slots share these memory channels, and the right
two share a separate channel. Let’s move to look inside the CPU at the processor. Along
with numerous cores and many other elements, we find the memory controller which manages
and communicates with the DRAM. There’s also a separate section for communicating with SSDs
plugged into the M2 slots and with SSDs and hard drives plugged into SATA connectors. Using
these sections, along with data mapping tables, the CPU manages the flow of data from
the SSD to DRAM, as well as from DRAM to cache memory for processing by the cores.
Let’s move back to see the memory channels. For DDR5 each memory channel is divided into two
parts, Channel A and Channel B. These two memory channels A and B independently transfer 32 bits at
a time using 32 data wires. Using 21 additional wires each memory channel carries an address
specifying where to read or write data and, using 7 control signal wires, commands are relayed.
The addresses and commands are sent to and shared by all 4 chips on the memory channel which
work in parallel. However, the 32-bit data lines are divided among the chips and thus each
chip only reads or writes 8 bits at a time. Additionally, power for DRAM is
supplied by the motherboard and managed by these chips on the stick itself.
Next, let’s open and look inside one of these DRAM microchips. Inside the exterior packaging,
we find an interconnection matrix that connects the ball grid array at the bottom with the die
which is the main part of this microchip. This 2 gigabyte DRAM die is organized into 8 bank groups
composed of 4 banks each, totaling 32 banks. Within each bank is a massive array, 65,536 memory
cells tall by 8192 cells across, essentially rows and columns in a grid, with tens of thousands of
wires, and supporting circuitry running outside each bank. Instead of looking at this die, we’re
going to transition to a functional diagram, and then reorganize the banks and bank groups.
In order to access 17 billion memory cells, we need a 31-bit address. 3 bits are used to
select the appropriate bank group, then 2 bits to select the bank. Next 16 bits of the address
are used to determine the exact row out of 65 thousand. Because this chip reads or writes 8
bits at a time, the 8192 columns are grouped by 8 memory cells, all read or written at a time,
or ‘by 8’, and thus only 10 bits are needed for the column address. One optimization is that
this 31-bit address is separated into two parts and sent using only 21 wires. First, the bank
group, bank, and row address are sent, and then after that the column address. Next, we’ll look
inside these physical memory cells, but first, let’s briefly talk about how these structures are
manufactured as well as this video’s sponsor. This incredibly complicated die,
also called an integrated circuit, is manufactured on 300-millimeter silicon wafers,
2500ish dies at a time. On each die are billions of nanoscopic memory cells that are fabricated
using dozens of tools and hundreds of steps in a semiconductor fabrication plant or fab. This
one was made by Micron which manufactures around a quarter of the world’s DRAM, including both
Nvidia’s and AMD’s VRAM in their GPUs Micron also has its own product line of DRAM and SSDs under
the brand Crucial which, as mentioned earlier, is the sponsor of this video. In addition
to DRAM, Micron is one of the world’s leading suppliers of solid-state drives such as this
Crucial P5+ M2 NVME SSD. By installing your operating system and video games on a Crucial
NVMe solid-state drive, you’ll be sure to have incredibly fast loading times and smooth gameplay,
and if you do video editing, make sure all those files are on a fast SSD like this one as well.
This is because the main speed bottleneck for loading is predominantly limited by the speed of
the SSD or hard drive where the files are stored. For example, this hard drive can only transfer
data at around 150 megabytes a second whereas this Crucial NVMe SSD can transfer data at a
rate of up to 6,600 megabytes a second, which, for comparison is the speed of a moving tortoise
versus a galloping horse. By using a Crucial NVMe SSD, loading a video game that requires gigabytes
of DRAM is reduced from a minute or more down to a couple seconds. Check out the Crucial NVMe
SSDs using the link in the description below. Let’s get back to the details of how DRAM works
and zoom in to explore a single memory cell situated in a massive array. This memory cell is
called a 1T1C cell and is a few dozen nanometers in size. It has two parts, a capacitor to store
one bit of data in the form of electrical charges or electrons and a transistor to access and read
or write data. The capacitor is shaped like a deep trench dug into silicon and is composed of
two conductive surfaces separated by a dielectric insulator or barrier just a few atoms thick, which
stops the flow of electrons but allows electric fields to pass through. If this capacitor
is charged up with electrons to 1 volt, it’s a binary 1, and if no charges are present
and it’s at 0 volts, it’s a binary 0, and thus this cell only holds one bit of data. Designs
of capacitors are constantly evolving but in this trench capacitor, the depth of the silicon is
utilized to allow for larger capacitive storage, while taking up as little area as possible.
Next let’s look at the access transistor and add in two wires. The wordline wire connects to
the gate of the transistor while the bitline wire connects to the other side of the transistor’s
channel. Applying a voltage to the wordline turns on the transistor, and, while it’s on,
electrons can flow through the channel thus connecting the capacitor to the bitline. This
allows us to access and charge up the capacitor to write a 1 or discharge the capacitor to write
a 0. Additionally, we can read the stored value in the capacitor by measuring the amount of
charge. However, when the wordline is off, the transistor is turned off, and the capacitor
is isolated from the bitline thus saving the data or charge that was previously written. Note
that because this transistor is incredibly small, only a few dozen nanometers wide, electrons slowly
leak across the channel, and thus over time the capacitor needs to be refreshed to recharge
the leaked electrons. We’ll cover exactly how refreshing memory cells works a little later.
As mentioned earlier, this 1T1C memory cell is one of 17 billion inside this single die and is
organized into massive arrays called banks. So, let’s build a small array for illustrative
purposes. In our array, each of the wordlines is connected in rows, and then the bitlines are
connected in columns. Wordlines and bitlines are on different vertical layers so one can
cross over the other, and they never touch. Let’s simplify the visual and use symbols for the
capacitors and the transistors. Just as before, the wordlines connect to each transistor’s control
gate in rows, and then all the bitlines in columns connect to the channel opposite each capacitor.
As a result, when a wordline is active, all the capacitors in only that row are
connected to their corresponding bitlines, thereby activating all the memory cells in that
row. At any given time only one wordline is active because, if more than one wordline were
active, then multiple capacitors in a column would be connected to the bitline and the data
storage functionalities of these capacitors would interfere with one another, making them useless.
As mentioned earlier, within a single bank there are 65,536 rows and 8,192 columns and the 31-bit
address is used to activate a group of just 8 memory cells. The first 5 bits select the bank,
and the next 16-bits are sent to a row decoder to activate a single row. For example, this
binary number turns on the wordline row 27,524, thus turning on all transistors in that row and
connecting the 8,192 capacitors to their bitlines, while at the same time the other 65
thousandish wordlines are all off. Here’s the logic diagram for a simple decoder.
The remaining 10 bits of the address are sent to the column multiplexer. This multiplexer
takes in the 8192 bitlines on the top, and, depending on the 10-bit address, connects a
specific group of 8 bitlines to the 8 input and output IO wires at the bottom. For example,
if the 10-bit address we this, then only the bitlines 4,784 through 4,791 would be connected
to the IO wires, and the rest of the 8000ish bitlines would be connected to nothing. Here’s
the logic diagram for a simple multiplexer. We now have the means of accessing any
memory cell in this massive array; however, to understand the three basic operations,
reading, writing, and refreshing let’s add two elements to our layout: A sense amplifier
at the bottom of each bitline, and a read and write driver outside of the column multiplexer.
Let’s look at reading from a group of memory cells. First the read command and 31-bit address
are sent from the CPU to the DRAM. The first 5 bits select a specific bank. The next step is
to turn off all the wordlines in that bank, thereby isolating all the capacitors, and then
precharge all 8000ish bitlines to .5 volts. Next the 16-bit row address turns on a row, and all
the capacitors in that row are connected to their bitlines. If an individual capacitor holds a 1
and is charged to 1 volt, then some charge flows from the capacitor onto the .5-volt bitline, and
the voltage on the bitline increases. The sense amplifier then detects this slight change
or perturbation of voltage on the bitline, amplifies the change, and pushes the voltage on
the bitline all the way up to 1 volt. However, if a 0 is stored in the capacitor, charge
flows from the bitline into the capacitor, and the .5-volt bitline decreases in voltage.
The sense amplifier then sees this change, amplifies it and drives the bitline voltage down
to 0 volts or ground. The sense amplifier is necessary because the capacitor is so small,
and the bitline is rather long, and thus the capacitor needs to have an additional component
to sense and amplify whatever value is stored. Now, all 8000ish bitlines are driven to 1
volt or 0 volts corresponding to the stored charge in the capacitors of the activated
row, and this row is now considered open. Next, the column select multiplexer uses
the 10-bit column address to connect the corresponding 8 bitlines to the read
driver which then sends these 8 values and voltages over the 8 data wires to the CPU.
Writing data to these memory cells is similar to reading, however with a few key differences.
First the write command, address, and 8 bits to be written are sent to the DRAM chip. Next, just
like before the bank is selected, the capacitors are isolated, and the bitlines are precharged
to .5 volts. Then, using a 16-bit address, a single row is activated, the capacitors perturb
the bitline, and the sense amplifiers sense this and drive the bitlines to a 1 or 0 thus opening
the row. Next the column address goes to the multiplexer, but, this time, because a write
command was sent, the multiplexer connects the specific 8 bitlines to the write driver which
contains the 8 bits that the CPU had sent along the data wires and requested to write. These
write drivers are much stronger than the sense amplifier and thus they override whatever voltage
was previously on the bitline, and drive each of the 8 bitlines to 1 volt for a 1 to be written,
or 0 volts for a 0. This new bitline voltage overrides the previously stored charges or values
in each of the 8 capacitors in the open row, thereby writing 8 bits of data to the memory
cells corresponding to the 31-bit address. Three quick notes. First, as a reminder, writing
and reading happens concurrently with all the 4 chips in the shared memory channel, using
the same 31-bit address and command wires, but with different data wires for each chip.
Second, with DDR5 for a binary 1 the voltage is actually 1.1 volts, for DDR4 it’s 1.2 volts,
and prior generations had even higher voltages, with the bitline precharge voltages being
half of these voltages. However, for DDR5, when writing or refreshing a higher voltage,
around 1.4 volts is applied and stored in each capacitor for a binary 1 because charge leaks
out over time. However, for simplicity, we’re going to stick with 1 and 0. Third, the number
of bank groups, banks, bitlines and wordlines varies widely between different generations
and capacities but is always in powers of 2. Let’s move on and discuss the third operation
which is refreshing the memory cells in a bank. As mentioned earlier, the transistors used to
isolate the capacitors are incredibly small, and thus charges leak across the channel. The
refresh operation is rather simple and is a sequence of closing all the rows, precharging
the bitlines to .5 volts, and opening a row. To refresh, just as before, the capacitors perturb
the bitlines and then the sense amplifiers drive the bitlines and capacitors of the open row fully
up to 1 volt or down to 0 volts depending on the stored value of the capacitor, thereby refilling
the leaked charge. This process of row closing, precharging, opening, and sense amplifying happens
row after row, taking 50 nanoseconds for each row, until all 65 thousandish rows are refreshed
taking a total of 3 milliseconds or so to complete. The refresh operation occurs
once every 64 milliseconds for each bank, because that’s statistically below the
worst-case time it takes for a memory cell to leak too much charge to make a stored 1
turn into a 0, thus resulting in a loss of data. Let’s take a step back and consider the
incredible amount of data that is moved through DRAM memory cells. These banks of memory
cells handle up to 4 thousand 8 hundred million requests to read and write data every second
while refreshing every memory cell in each bank row by row around 16 times a second.
That’s a staggering amount of data movement and illustrates the true strength of computers.
Yes, they do simple things like comparisons, arithmetic, and moving data around, but
at a rate of billions of times a second. Now, you might wonder why computers
need to do so much data movement. Well, take this video game for example. You have obvious
calculations like the movement of your character and the horse. But then there are individual
grasses, trees, rocks, and animals whose positions and geometries are stored in DRAM.
And then the environment such as the lighting and shadows change the colors and textures of the
environment in order to create a realistic world. Next, we’re going to explore breakthroughs and
optimizations that allow DRAM to be incredibly fast. But, before we get into all those
details, we would greatly appreciate it if you could take a second to hit that like
button, subscribe if you haven’t already, and type up a quick comment below, as it helps get
this video out to others. Also, we have a Patreon and would appreciate any support. This is our
longest and most detailed video by far, and we’re planning more videos that get into the inner
details of how computers work. We can’t do it without your help, so thank you for watching and
doing these three quick things. It helps a ton. The first complex topic which we’ll explore
is why there are 32 banks, as well as what the parameters on the packaging of DRAM are.
After that, we’ll explore burst buffers, sub-arrays, and folded DRAM architecture
and what’s inside the sense amplifier. Let’s take a look at the banks. As
mentioned earlier opening a single row within a bank requires all these
steps and this process takes time. However, if a row were already open, we
could read or write to any section of 8 memory cells using only the 10-bit
column address and the column select multiplexer. When the CPU sends a read or
write command to a row that’s already open, it’s called a row hit or page hit, and this
can happen over and over. With a row hit, we skip all the steps required to open a row, and
just use the 10-bit column address to multiplex a different set of 8 columns or bitlines, connecting
them to the read or write driver, thereby saving a considerable amount of time. A row miss is
when the next address is for a different row, which requires the DRAM to close and isolate the
currently open row, and then open the new row. On a package of DRAM there are typically 4 numbers
specifying timing parameters regarding row hits, precharging, and row misses. The first number
refers to the time it takes between sending an address with a row open, thus a row hit, to
receiving the data stored in those columns. The next number is the time it takes to open
a row if all the lines are isolated and the bitlines are precharged. Then the next number
is the time it takes to precharge the bitlines before opening a row, and the last number is
the time it takes between a row activation and the following precharge. Note that these
numbers are measured in clock cycles. Row hits are also the reason why the address is
sent in two sections, first the bank selection and row address called RAS and then the column address
called CAS. If the first part, the bank selection and row address, matches a currently open row,
then it’s a row hit, and all the DRAM needs is the column address and the new command, and then the
multiplexer simply moves around the open row. Because of the time saving in accessing an
open row, the CPU memory controller, programs, and compilers are optimized for increasing the
number of subsequent row hits. The opposite, called thrashing, is when a program jumps around
from one row to a different row over and over, and is obviously incredibly inefficient
both in terms of energy and time. Additionally, DDR5 DRAM has 32 banks for
this reason. Each bank’s rows, columns, sense amplifiers and row decoders operate
independently of one another, and thus multiple rows from different banks can be open all at the
same time, increasing the likelihood of a row hit, and reducing the average time it takes for the CPU
to access data. Furthermore, by having multiple bank groups, the CPU can refresh one bank in each
bank group at a time while using the other three, thus reducing the impact of refreshing.
A question you may have had earlier is why are banks significantly taller than they are
wide? Well, by combining all the banks together one next to the other you can think of this chip
as actually being 65 thousand rows tall by 262 thousand columns wide. And, by adding 31 equally
spaced divisions between the columns, thus creating banks, we allow for much more flexibility
and efficiency in reading, writing and refreshing. Also, note that on the DRAM packaging are
its capacity in Gigabytes, the number of millions of data transfers per second, which
is two times the clock frequency, and the peak data transfer rate in Megabytes per second.
The next design optimization we’ll explore is the burst buffer and burst length. Let’s add a
128-bit read and write temporary storage location, called a burst buffer to our functional diagram.
Instead of 8 wires coming out of the multiplexer, we’re going to have 128 wires that connect
to these 128-bit buffer locations. Next the 10-bit column address is broken into two
parts, 6 bits are used for the multiplexer, and 4 bits are for the burst buffer.
Let’s explore a reading command. With our burst buffer in place, 128 memory cells and
bitlines are connected to the burst buffer using the 6 column bits, thereby temporarily loading,
or caching 128 values into the burst buffer. Using the 4 bits for the buffer, 8 quickly
accessed data locations in the burst buffer are connected to the read drivers and the data is
sent to the CPU. By cycling through these 4 bits, all 16 sets of 8 bits are read out, and thus the
burst length is 16. After that a new set of 128 bitlines and values are connected and loaded
into the burst buffer. There’s also a write burst buffer which operates in a similar way.
The benefit of this design is that 16 sets of 8 bits per microchip, totaling 1024 bits, can be
accessed and read or written extremely quickly, as long as the data is all next to one
another, but at the same time we still have the granularity and ability to access any
set of 8 bits if our data requests jump around. The next design optimization is that this bank
of 65536 rows by 8192 columns is rather massive, and results in extremely long wordlines and
bitlines, especially when compared to the size of each trench capacitor memory cell. Therefore,
the massive array is broken up into smaller blocks 1,024 by 1,024, with intermediate
sense amplifiers below each subarray, and subdividing wordlines and using a hierarchical
row decoding scheme. By subdividing the bitlines, the distance and amount of wire that each tiny
capacitor is connected to as it perturbs the bitline to the sense amplifier is reduced, and
thus the capacitor doesn’t have to be as big. By subdividing the wordlines the capacitive load from
eight thousandish transistor gates and channels is decreased, and thus the time it takes to turn on
all the access transistors in a row is decreased. The final topic we’re going to talk about is
the most complicated. Remember how we had a sense amplifier connected to the bottom of
each bitline? Well, this optimization has two bitlines per column going to each sense amplifier
and alternating rows of memory cells connected to the left and right bitlines, thus doubling the
number of bitlines. When one row is active, half of the bitlines are active while the other
half are passive and vice versa when the next row is active. Moving down to see inside the sense
amplifier we find a cross-coupled inverter. How does this work? Well, when the active bitline is
a 1, the passive bitline will be driven by this cross-coupled inverter to the opposite value
of 0, and when the active is a 0, the passive becomes a 1. Note that the inverted passive
bitline isn’t connected to any memory cells, and thus it doesn’t mess up any stored data. The
cross-coupled inverter makes it such that these two bitlines are always going to be opposite
one another, and they’re called a differential pair. There are three benefits to this design.
First, during the precharge step, we want to bring all the bitlines to .5 volts and, by having a
differential pair of active and passive bitlines, the easiest solution is to disconnect the cross
coupled inverters and open a channel between the two using a transistor. The charge easily
flows from the 1 bitline to the 0, and they both average out and settle at .5 volts.
The other two benefits are noise immunity, and a reduction in parasitic capacitance of the
bitline. These benefits are related to that fact that by creating two oppositely charged electric
wires with electric fields going from one to the other we reduce the amount of electric fields
emitted in stray directions and relatedly increase the ability of the sense amplifier to amplify
one bitline to 1 volt and the other to 0 volts. One final note is that when discussing DRAM,
one major topic is the timing of addresses, command signals and data, and the related
acronyms DDR or double data rate, and SDRAM, or Synchronous DRAM. These topics were omitted
from this video because it would have taken an additional 15 minutes to properly explore.
That’s pretty much it for the DRAM, and we are grateful
you made it this far into the video. We believe the future will require a strong emphasis on
engineering education and we’re thankful to all our Patreon and YouTube Membership Sponsors
for supporting this dream. If you want to support us on YouTube Memberships, or Patreon,
you can find the links in the description. A huge thanks goes to the Nathan, Peter, and
Jacob who are doctoral students at the Florida Institute for Cybersecurity Research for helping
to research and review this video’s content! They do foundational research on finding the weak
points in device security and whether hardware is compromised. If you want to learn more about
the FICS graduate program or their work, check out the website using the link in the description.
This is Branch Education, and we create 3D animations that dive deep into the technology that
drives our modern world. Watch another Branch video by clicking one of these cards or click here
to subscribe. Thanks for watching to the end!