During the mid 1960s a revolution
in miniaturization was kick started. The idea of packing dozens of
semiconductor based transistors on to a single silicon chip
spawned the integrated circuit. It laid the groundwork for a
complete paradigm shift in how modern society would evolve. In less than a decade, this
marvel of electronic engineering and materials sciences would
lead in an era of advancement incomparable to anything
else in human history. In the March of 1971, the
commercial launch of a new semiconductor product set the stage for this new era. Composed of a then-incredible 2,300
transistors, the Intel 4004 central processing unit or CPU was released. Initially created as a custom solution
to the Japanese company Busicom Corp. for use in the Busicom 141-PF
calculator, it was released later that year to the general public. With prophetic irony, the
marketing material for the chip touted the slogan “Announcing a new era in
integrated electronics”. But what made the Intel
4004 so groundbreaking? Take calculator and solve any
simply arithmetic operation, let’s say 22 divided by 7. What we just did was issue
a computer an instruction. Instructions are elementally
operations, such as math commands that a CPU executes. Every computer program ever made,
from web browsers, to apps, to video games is composed of millions of these instructions. The 4004 was capable of
executing between 46,250 to 92,500 instructions per second. For comparison, ENIAC, the first
electronic computer built just 25 years earlier could only execute
5,000 instructions a second. But what made the 4004 so
powerful wasn’t just its 1800% increase in processing power
- it only consumed 1 watt of electricity, was about ¾” long and
cost $5 to produce in today’s money. This was miles ahead of
ENIAC’s, cost of $5.5 million in today’s money, 180kW power
consumption and 27 ton weight. Fast forward to September
2017, the launch date of the Intel Core i9-7980 XE. This CPU is capable of performing
over 80 billion instructions a second, a 900,000 time
increase in processing power. What did it take to get here? In this 2 part series we explore the
engineering and behind the scenes technology that paved the way for
that simple 16 pin chip to evolve in the powerhouse of CPUs today. This is the evolution
of processing power. HOW A CPU STORES DATA In order to understand how a CPU
derives its processing power, let examine what a CPU actually does
and how it interfaces with data. In digital electronics everything
is represented by the binary “bit”. It’s an elemental representation
of two possible states. A bit can represent a zero or one,
true or false, up or down, on or off, or any other bi-state value. In a CPU, a “bit” is physically
transmitted as voltage levels. If we combine multiple “bits” together
in a group, we can now represent more combinations of discrete states. For example, if we combine
eight bits together we form what's known as a byte. A byte can represent 256
different states and can be used to represent numbers. In the case of a byte, any number
between 0 and 255 can be expressed. But in a CPU, how we
choose to represent data is completely malleable. That same byte, can also represent
a number between -128 to 127. Other expressions of that byte
may be colors or levels of sound. When we combine multiple
bytes together, we create what's known as a word. Words are expressed
in their bit capacity. A 32-bit word contains 32-bits. A 64-bit word contains
64 bits and so on. When processors are created, the
native word size it operates on forms the core of its architecture. The original Intel 4004 processor
operated on a 4-bit word. This means data moving
through the CPU transits in chunks of four bits at time. Modern CPUs are typical 64-bit,
however 32-bit processors are still quite common. By making use of larger word sizes
we can represent more discrete states and consequently larger numbers. A 32-bit word for example,
can represent up to 4.2 billion different states. Of all the forms data can take
inside of a CPU the most important one is that of an instruction. Instructions are unique bits
of data, that are decoded and executed by the CPU as operations. An example of a common instruction
would be to add two words values together or move a word
of data from one location in memory to another location. The entire list of instructions
a CPU supports is called its instruction set. Each instruction’s binary
representation, its machine code is typically assigned a
human readable presentation known as a assembly language. If we look at the instruction set
of most CPU’s, they all tend to focus around performing math or
logical operations on data, testing conditions or moving it from one
location to another in memory. For all intents and purposes,
we can think of a CPU as an instruction processing machine. They operate by looping
through three basic steps, fetch, decode, and execute. As CPU designs evolve these three step
become dramatically more complicated and technologies are implemented that
extend this core model of operation. But in order to fully appreciate
these advances, let's first explore the mechanics of basic CPU operation. Known today as the “classic
Reduced Instruction Set Computer or [RISC] pipeline”, this paradigm
formed the basis for the first CPU designs, such as the Intel 4004. In the fetch phase, the CPU
loads the instruction it will be executing into itself. A CPU can be thought of as
existing in an information bubble. It pulls instructions and data from
outside of itself, performs operations within its own internal environment,
and then returns data back. This data is typically stored in
memory external of the CPU called Random Access Memory or [RAM]. Software instructions and
data are loaded into RAM from more permanent sources such as
hard drives and flash memory. But at one point in history
magnetic tape, punch cards, and even flip switches were used. When a CPU loads a word of data
it does it by requesting the contents of a location in RAM. This is called the data’s address. The amount of data a CPU can
address at one time is determined by its address capacity. A 4 bit address for example, can only
directly address 16 locations of data. Mechanisms exist for addressing more
data than the CPUs address capacity, but let's ignore these for now. The mechanism by which data moves
back and forth to RAM is called a bus. A bus can be thought of as
a multi-lane highway between the CPU and RAM is which each
bit of data has its own lane. But we also need to transmit the
location of the data we’re requesting, so a second highway must be added
to accommodate both the size of the data word and the address word. These are called the data bus
and address bus respectively. In practice these data and address
lines are physical electrical connections between the CPU and
RAM and often look exactly like a superhighway on a circuit board. When a CPU makes a request for
RAM access, a memory control region of the CPU loads the
address bus with the memory word address it wishes to access. It then triggers a control line
that signals a memory read request. Upon receiving this request the RAM
fills the data bus with the contents of the requested memory location. The CPU now sees this data on the bus. Writing data to RAM works in
a similar manner, with CPU posting to the data bus instead. When the RAM received a “write”
signal, the contents of the data bus is written to the RAM location
pointed to by the address bus. The address of the memory location
to fetch is stored in the CPU, in a mechanism called a register. A register is a high speed internal
memory word that is used as a “notepad” by CPU operations. It’s typically used as a temporary
data store for instructions but can also be assigned to
vital CPU functions, such as keeping track of the current
address being accessed in RAM. Because they are designed innately
into the CPU’s hardware, most only have a handful of registers. Their word size is generally coupled
to the CPU’s native architecture. Once a word of memory is read into
the CPU, the register that stores the address of that words, known as
a Program Counter is incremented. On the next fetch, it retrieves
the next instruction, in sequence. Accessing data from RAM is typically
the bottleneck of a CPUs operation. This is due to the need to
interface with components physically distant from the CPU. On older CPUs this doesn’t present
much of a problem, but as they get faster the latency of memory
access becomes a critical issue. The mechanism of how this is
handled is key to the advancement of processor performance and will
be examined in part 2 of this series as we introduce cacheing. Once an instruction is fetched
the decode phase begins. In classic RISC architecture,
one word of memory forms a complete instruction. This changes to a more elaborate
methods as CPUs evolve to complex instruction set archicture,
which will be introduced in part 2 of this series. When a instruction is decoded,
the word is broken down into two parts known as bitfields. These are called an
opcode and an operand. A opcode is a unique series of
bits that represent a specific function within the CPU. Opcodes generally instruct the CPU
to move data to a register, move data between a register and memory,
perform math or logic functions on a registers and branching. Branching occurs when an
instruction causes a change in the program counter’s address. This causes the next fetch to occur
at new location in memory as oppose to the next sequential address. When this “jump” to a new program
location is guaranteed, it’s called an unconditional branch. In other cases a test can be done to
determine if a “jump” should occur. This is known as a conditional branch. The tests that trigger these
conditions are usually mathematical, such is if a register or
memory location is less than or greater than a number, or
if it is zero or non zero. Branching allows program
to make decisions and are crucial to the power of a CPU. Opcodes sometimes requires data
to perform an its operation on. This part of an instruction
is called a operand. Operands are bits piggy backed onto
an instruction to be used as data. Let say we wanted to
add 5 to a register. The binary representation of the
number 5 would be embedded in the instruction and extracted by the
decoder for the addition operation. When an instruction has an embedded
constant of data within it, its know as an immediate value. In some instructions the operand
does not specify the value it self, but contains an address to a
location in memory to be accessed. This is common in opcodes
that request a memory word to be loaded into a register. This is known as addressing,
and can get far more complicated in modern CPUs. Addressing can result in a
performance penalty because of the need to “leave” the CPU but this
is mitigated as CPU design advance. Once we have our opcode and operand
the opcode is matched by means of a table and a combination of circuiry
where a control unit then configures various operational sections of
the CPU to perform the operation. In some modern CPU’s the decode phase
isn’t hardwired, and can be programed. This allows for the changing in how
instructions are decoded and the CPU is configured for execution. In the execution phase the now
configured CPUs is triggered. This may occur in a single
step or a series of steps depending on the opcode. One of the most commonly used
sections of a CPU in execution is the Arithmetic Logic Unit or ALU. This block of circuitry is designed
to take in two operands and perform either basic arithmetic or
bitwise logical operations on them. The result are then outputted
along with respective mathematical flags, such as a carry over,
an overflow or a zero result. The output of the ALU is then sent
to either a register or a location in memory based on the opcode. Let's say an instruction calls
for adding a 10 to a register and placing the result in that register. The control unit of the CPU
will load the immediate value of the instruction into the ALU, load the value of the register
into the ALU and connect the ALU output to the register. On the execute trigger the
addition is done and the output loaded into the register. In effect, software distills
down to a loop of configuring groups of circuits to interact
with each other within a CPU. In a CPU these 3 phases of operation
loop continuously, workings its way through the instruction of the
computer program loaded in memory. Gluing this looping machine
together is a clock. A clock is a repeating pulse use
to synchronize a CPU’s internal mechanics and its interface
with external components. CPU clock rate is measured by the
number of pulses per second, or Hertz. The Intel 4004 ran at 740 KHz
or 740,000 pulses a second. Modern CPUs can touch clock
rates approaching 5GHz, or 5 billion pulses a second. On simpler CPUs a single clock
triggers the advance of the fetch, decode, and execute stage. As CPUs get more sophisticated
these stages can take several clock cycles to complete. Optimizing these stages and their
use of clock cycles are key to increasing processing power and will
be discussed in part 2 of this series. The throughput of a CPU, the amount
of instructions that can be executed a second determines how “fast” it is. By increasing the clock rate,
we can make a processor go through its stages faster. However as we get faster
we encounter a new problem. The period between clock cycles
has to allow for enough time for every possible instruction
combination to execute. If a new clock pulse happens
before an instruction cycle completes, results become
unpredictable and the program fails. Furthermore, increasing clock rates
has the side effect of increasing power dissipation and a buildup
of heat in the CPU causing a degradation of cirucity performance. The battle to run CPU’s faster
and more efficiently has dominated its entire existence. In the next part of this series we’ll
explore the expansion of CPU designs from that simple 2,300 transistor
device of the 1970s, through the microcomputing boom of the 1980s, and
onward to the multi-million transistor designs of the 90s and early 2000s. We’ll introduce the rise of
pipelining technology, caching, the move to larger bit CISC
architecture, and charge forward to multi GHz clock rates.
Not the newest video, but an absolute jewel in its awesomeness, so I had to share - since I haven't seen it on reddit
Thanks! Gonna watch this tomorrow.
I’ve been reading this well received book on the same subject:
https://www.amazon.com/Code-Language-Computer-Hardware-Software/dp/0735611319
Jokes on you I've already watched the full series twice.