I would like to suggest that the classical taxonomy of RISC/CISC dichotomy is ba...

abainbridge · on March 15, 2024

Yep. RISC was interesting when gate budgets for CPU pipelines were seriously limited. It was interesting because before RISC the industry had been merrily spending the gate budget increase on adding lots of use-specific instructions. The RISC people pointed out that if you removed support for all the fancy instructions you had enough gate budget for the ALU to be nicely pipelined, and then you could wind up the clock rate greatly and this was worth much more than the fancy instructions.

For decades now we've had enough gate budget to have nicely pipelined designs with complex instruction sets, so that's what everyone does. RISC solves a problem that no longer exists.

thesz · on March 15, 2024

You are not quite right about pipelined design being faster. At least, not without substantial effort.

https://en.wikipedia.org/wiki/R2000_microprocessor

"The R2000 is a 32-bit microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture (ISA)..."

"The R2000 was available in 8.3, 12.5 and 15 MHz grades..."

https://en.wikipedia.org/wiki/I386

"The Intel 386, originally released as 80386 and later renamed i386, is a 32-bit microprocessor introduced in 1985..."

"Max. CPU clock rate: 12.5 MHz to 40 MHz"

As you can see, 80386 was released a year earlier than R2000 and was about 1.5 times faster than MIPS implementation from the start.

The critical path is, usually, in addition/subtraction, which should be complete in one cycle in both 80386 and in R2000. To pipeline addition you need a superpipelined CPU, one that has several stages for computation. Even seemingly simple computation of condition codes can make clock cycle 10% longer (SPARC vs MIPS) if your CPU is just simply pipelined.

BTW, some Pentiums did computed 32-bit addition in two cycles, all in name of higher clock frequencies.

abainbridge · on March 15, 2024

Interesting. An R2000 did run programs faster than a 80386, right? This was a few years before my time.

From a quick google now, it looks like the R2000 was about 3x better than the 80386 at Dhrystone MIPS/MHz. I guess an accurate comparison of how the R2000 and 80386 spent they gate budget and what they got in return would involve a lot of detail.

I remember my compsci professor giving us the computer architecture course in about 1997, and he dispaired at how all the clever RISC stuff in the Patterson and Hennessey seemed irrelevant when Intel could just throw money at the implementation (and fab, I guess) and produce competitive chips despite their (allegedly) inferior architecture.

thesz · on March 15, 2024

My point is that you cannot get design much faster in terms of clock frequency by just pipelining. Pipeline unrolls state machine and overlaps different executions of the state machines. But the bottleneck, which is addition, is there in all designs and you need additional effort to break it.

(also MIPS has [i]ntelocked [p]ipeline [s]tages - that "IPS" in MIPS; I implemented it, I know - exception in execution should inform other stages about failure)

By the 1997 Intel has already bought Elbrus II design team, lead by Pentkovski [1]. That Pentkovski guy made Elbrus 2 a superscalar CPU with a stack machine front-end. E.g., Elbrus 2 executed stack operations in a superscalar fashion. You can entertain yourself by figuring out how complex or simple can that be.

[1] https://en.wikipedia.org/wiki/Vladimir_Pentkovski

So at the time your professor complained about Intel's inferior architecture being faster, that inferior architecture implementation has a translation unit inside it to translate x86 opcodes into superscalar-ready uops.

abainbridge · on March 16, 2024

I think the Wikipedia page [1] agrees with your main point.

I said pipelining allowed you to increase the clock rate, which isn't the best thing to say.

The wiki page says, "instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (the eponymous "pipeline") performed by different processor units with different parts of instructions processed in parallel."

And, "This arrangement lets the CPU complete an instruction on each clock cycle. It is common for even-numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge. This allows more CPU throughput than a multicycle computer at a given clock rate, but may increase latency due to the added overhead of the pipelining process itself."

[1] https://en.wikipedia.org/wiki/Instruction_pipelining

peterfirefly · on March 15, 2024

Addition was not the bottle neck for the 386. It had a FO4 delay of 80+ per clock. An adder is much faster.

Maybe you meant that it was (one, just one!, of many of) the bottle neck(s) in an optimized implementation?

thesz · on March 15, 2024

> Addition was not the bottle neck for the 386.

It is a bottleneck for MIPS, SPARC, Alpha and not for 386. How so?

peterfirefly · on March 16, 2024

The 386 wastes so many FO4 gate delays on other things. I thought I made that extremely clear?

thesz · on March 16, 2024

Can you elaborate on where the delays came from?

peterfirefly · on March 15, 2024

And the R2000 was implemented in 2µm and the (early) 386 was implemented in 1.5µm. Double-metal for both. Didn't bother to look up die size.

smcin · on March 15, 2024

It's not apples-to-apples to compare raw clock rates between semiconductor processes; Intel's 386 was intially fabbed on 1.5μ then shrunk to 1.0μ process (Intel CHMOS III and IV), whereas MIPS R2000 was 2.0μ, fabless and relied on Sierra, Toshiba, then in 1987 LSI, IDT and other licensees [0][1].

Back in the 1980s/90s/2000s, Intel was consistently a process generation or two ahead of competitors. That was one of their main sources of advantage.

Just imagine if MIPS had been able to fab on Intel process.

[0]: https://www.righto.com/2023/10/intel-386-die-versions.html

[1]: https://en.wikipedia.org/wiki/R2000_microprocessor

thesz · on March 15, 2024

And you are also confirm that in order to have higher clock frequency you need more than just pipelining.

Thank you.

I also think that 1.5x difference in clock speeds cannot be directly attributed to the difference between node size (lambda): difference in lambdas 1.3(3)=2.0/1.5 at the introduction of the 80386 and R2000 is noticeably less than 1.47=12.5/8.5.

smcin · on March 15, 2024

Smaller transistors are faster, but the relationship between clock frequency and 1/feature size isn't necessarily linear like you're assuming.

https://cs.stackexchange.com/questions/27875/moores-law-and-...

thesz · on March 15, 2024

My assumption is that speedup is less than lambda's ratio.

bananabiscuit · on March 15, 2024

Is there something about RISC that is still makes it better than CISC when it comes to per-watt performance? Seems like nobody has any success making an x86 processor that's as power efficient as ARM or RISC.

simne · on March 15, 2024

> Is there something about RISC that is still makes it better than CISC when it comes to per-watt performance?

CLASSIC CISC was micro-coded (for example, IBM S/360 have feature, you could make your custom microcode for compatibility with your inherited equipment, like IBM-1401 machines or IBM-7XXX series, or for other purposes), and RISC was with pipeline from birth.

Second thing, as I understand, many CISC existed as multiple chips board or even as multiple boards, so have great losses on wires, but RISC appear in 1990s as one die immediately (only external cache added as additional IC), but I could mistake on this.

> nobody has any success making an x86 processor that's as power efficient as ARM or RISC

Rumors said, Intel Atom (essentially CMOS version of Pentium first generations) was very good in mobiles, but ARM far succeed it on software support of huge number of power saving features (modern ARM SOC allows to turn off near any part of chip any time and OS support this), and because of lack of software support, smartphones with Intel have poor time on battery.

More or less official info said, that Intel made bad power conversion circuit, so Atom consumes too much in mode between deep sleep and full speed, but I don't believe them, as this is too obvious mistake for hardware developer.

peterfirefly · on March 16, 2024

> CLASSIC CISC was micro-coded

Sometimes. Far from always. Some would have a complicated hardwired state machine. Some would have a complicated hardwired state machine and be pipelined. Some would have microcode and be pipelined (by flowing the microcode bits through the pipeline and of course dropping those that have already been used so less and less microcode bits survive at each stage).

simne · on March 16, 2024

Please give classic CISC examples, which was not microcoded and why you think they classic.

From my opinion, NONE of microprocessors could be considered classic CISC.

timeinput · on March 16, 2024

The PDP11-20 (the first PDP11) was not microcoded.

The PDP11 is the machine that unix was developed on.

simne · on March 16, 2024

Did you know for what purposes (targets) made mini-computers and why they was limited?

simne · on March 17, 2024

Well, as I see you don't have enough bravery to answer simple question about purpose of mini-computers, so I will.

When computers first appeared, they was big, just because technology limitations made small machines very expensive to use, so scale used to make computations cheaper.

In early 1970s, technology advanced to stage, where become possible to make simplified versions of big computers for some limited tasks, still too expensive for wide use.

Simple illustration, IBM-3033 mainframe with 16M RAM could serve 17500 3270 terminals, and PDP of same time could about few tens (may be 50, I don't know exactly), so mainframes even when was very expensive, but given good cost per workplace.

Known example, PDP used to control one of scientific nuclear reactor. PDP chosen, not because it have best mips/price ratio, but because it was cheapest adequate machine for this task, so is affordable for limited budget.

Very long time, mini machines stay in niche of limited machines, used to avoid much more expensive full-scale mainframes. They used to control industrial automation (CNC), chemical factories and other small things.

Once appeared microcomputers (CPU on one chip), they begin eat mini's space from bottom, when mainframes continue to become more cost effective (more terminals with appearance of cheap modems, etc) and eat mini's space from top.

And in 1990s, when appeared affordable 32-bit microprocessors and became affordable Megabytes of RAM, mini's disappear, because their place was captured by micro's.

To be honest, I just don't know anything we could not name microcomputer now, as even IBM Z mainframes are now have single-chip processor and largest supercomputers are practically clouds of SOCs (NUMA architecture).

And I must admit, I still see PDP's (or VAX's) on enterprises, where they still control old machines from 1990s (they are very reliable even when limited from modern view, but still work).

As I remember, last symmetrical multiprocessor supercomputer was Cray Y-MP, later machines become ccNUMA or just NUMA or even cloud.

https://en.wikipedia.org/wiki/LINPACK

Unix was simplified version of Multics, system considered to run on mainframes (BTW even exists officially certified Unix for mainframes).

You could try mainframes software yourself, it is very affordable now with emulator (sure, be careful about license):

https://en.wikipedia.org/wiki/Hercules_(emulator)

And you will see yourself, how many things borrowed by modern OS's from mainframes.

This is nature, people choose simpler, cheaper thing.

card_zero · on March 15, 2024

Have there been recent attempts? Maybe it's just, like, speciation, by this point in time.

mcbishop · on March 16, 2024

A great discussion on this: Lex Fridman's interview of David Patterson.

jcranmer · on March 15, 2024

RISC/CISC is near the top of the list of "things emphasized in education that bear little relevance in practice". RISC isn't so much a single coherent design idea as a collection of ideas, some of which have won out (more register files), and some of which haven't (avoid instructions that take multiple clock cycles). The architectures from the days the "debate" was more relevant that have had the most success are the ones which most thoroughly blurred the lines between classical RISC and CISC--namely, Arm and x86.

> I think there is another blurry line between superscalar and VLIW architecture, too.

No, the line is pretty damn sharp. The core idea behind VLIW is that having hardware doing dynamic scheduling (as superscalar does) is silly and the compiler should be responsible for statically scheduling all instructions. The only blur here is that both VLIW and superscalar envision having multiple execution units that can be simultaneously scheduled with work, but who is responsible for doing that scheduling is pretty distinct.

stevefan1999 · on March 17, 2024

Okay, I will have to review the lecture about VLIW, again: https://www.youtube.com/watch?v=nHHsYp7ZkHQ

simne · on March 16, 2024

> classical taxonomy of RISC/CISC dichotomy is basically non-existent nowadays

After digest information about IBM 360, I decided, we lost CISCs. One of most important feature of 360 was customizable microcode, which you could load on system boot and got effectively different hardware (like with FPGA emulators of Amiga's). It was widely used to emulate old hardware, like IBM 1401 or IBM 7xxx series. But I have not seen this feature in 390 documentation, so looks like their 360 emulation become just software (and with achievements of semiconductors in 1990s it looks like adequate, to switch to software emulation).

I must admit, ARM marketed feature of customized microcode, to add new instructions (they have standardized place in instruction set, named "custom coprocessor instructions", so if you have enough money, you could make special ARM with your additional instructions), but it is nothing if compare to 360.

simne · on March 18, 2024

Did you know for what purposes (targets) made mini-computers and why they was limited?

When computers first appeared, they was big, just because technology limitations made small machines very expensive to use, so scale used to make computations cheaper.

In early 1970s, technology advanced to stage, where become possible to make simplified versions of big computers for some limited tasks, still too expensive for wide use.

Simple illustration, IBM-3033 mainframe with 16MBytes RAM could serve 17500 3270 terminals, and PDP of same time could about few tens (may be 50, I don't know exactly), so mainframes even when was very expensive, but given good cost per workplace.

Known example, PDP used to control one of scientific nuclear reactor. PDP chosen, not because it have best mips/price ratio, but because it was cheapest adequate machine for this task, so is affordable for limited budget.

Very long time, mini machines stay in niche of limited machines, used to avoid much more expensive full-scale mainframes. They used to control industrial automation (CNC), chemical factories and other small things.

Once appeared microcomputers (CPU on one chip), first known on wide market in 1977, they begin eat mini's space from bottom, when mainframes continue to become more cost effective (more terminals with appearance of cheap modems, etc) and eat mini's space from top.

And in 1990s, when appeared affordable 32-bit microprocessors and became affordable Megabytes of RAM, mini's disappear, because their place was captured by micro's.

To be honest, I just don't know anything we could not name microcomputer now, as even IBM Z mainframes are now have single-chip processor and largest supercomputers are practically clouds of SOCs (NUMA architecture).

And I must admit, I still see PDP's (or VAX's) on enterprises, where they still control old machines from 1990s (they are very reliable even when limited from modern view, but still work).

As I remember, last symmetrical multiprocessor supercomputer was Cray Y-MP, later machines become ccNUMA or just NUMA or even cloud.

https://en.wikipedia.org/wiki/LINPACK

Unix was simplified version of Multics, system considered to run on mainframes (BTW even exists officially certified Posix Unix for mainframes).

You could try mainframes software yourself, it is very affordable now with emulator (sure, be careful about license):

https://en.wikipedia.org/wiki/Hercules_(emulator)

And you will see yourself, how many things borrowed by modern OS's from mainframes.

This is nature, people choose simpler, cheaper thing (yes, I don't like x86, my love is 68k).

fulafel · on March 20, 2024

I don't know if a reduced instruction set can "take inspiration" from a big one, it just becomes a non reduced one.

Also these examples don't feel right for me: x64 has the same number of registers as CISCs traditionally did (eg m68k, z/architecture, vax). 32-bit x86 was just an exceptionally register-starved CISC. And SIMD postdates RISC vs CISC divide for a long time, both schools of architecture got SIMD around the same time.

But the divide has become less relevant because originally instruction set affected chip area a lot, and there were big gains to be had by the quantitative approach of benchmarking compiled apps with different proposed instruction sets and seeing what runs fastest when transistors are spent on hot instructions vs execution engine resources. Nowadays we have more transistors than we know what to do with, and just put in lots of cores that end up sitting idle because of diminishing returns trying to speed up cores with more transistors.

titzer · on March 15, 2024

I agree that the line is pretty thin, but would draw it as: fixed-width versus variable-width. I think the M? line of Apple CPUs, with extremely-wide parallel decode, has been a game changer. The performance per watt is really off the charts. That's partly due to integrated RAM and all, but mostly due to microarchitectural changes, which I believe to be a massive step function in superscalar bandwidth (wider decode, huge ROB, huge numbers of ports). It seems like the power-hungry decode stage has been tamed, and I think this is because of fixed-width instructions in arm.

gumby · on March 15, 2024

> RISC designs also taken some inspirations from CISC (such as having SIMD/vectorization units)

I think that one went the other way: for example the PlayStation 2 used a MIPS chip with 256-bit SIMD instructions (the TMPR 5900) as well as a dedicated GPU (the so-called “emotion engine”)

Apofis · on March 18, 2024

With all of the sanctions China is really starting to push RISC forward, challenge ARM, and they are starting to find success. That "No" has absolutely changed to a "Maybe".

https://www.prnewswire.com/news-releases/global-and-china-au...

CalChris · on March 15, 2024

A 20,000 gate minimal RISC-V RV32E controller CPU isn't going to use μops. In 2024, RISC-V has turned that 2015 No into an unqualified Yes even if the microarchitecture of more complex OOO RISC-V systems resemble the microarchitectures of similarly complex x86 and ARM CPUs.

wmf · on March 15, 2024

To the extent that RISC-V is successful it's due to openness/freeness; RISC has nothing to do with it. An open "CISC-V" community would have been just as successful (and people wouldn't gripe about instruction fusion).

pclmulqdq · on March 15, 2024

I have made a few of those RISC-V CPUs. The minute the "M" instruction set shows up (with division), a macro/micro-op split becomes worth it if you want to minimize gate count or maximize speed.