IBM finally took the wraps off its much anticipated Power6 microprocessor, which company executives said will double the clock speed of its current Power5 chip, without stretching the power envelope. The Power6 processor, unveiled at an event on May 21 in London, is a dual-core chip with a top clock speed of 4.7GHz, double the 2.3GHz of the Power5+ processors. The new chip also includes 8MB of L2 cache – four times as large as the current Power5 offering – and an internal bandwidth of 300GB per second. Ars’ John ‘Hannibal’ Stokes obviously also has his say.
Before anyone mentions Apple, these aren’t designed for desktops, and certainly aren’t suited for laptops.
Can we not mention A* in this thread? This CPU gets installed in MCMs the size of your hand. Each die consumes 160W and connects to a dozen channels of DDR2. It’s no more relevant to A* than is Itanium.
It was going to happen anyway, I was just throwing out a premtive “wouldn’t work for Apple”.
Unfortunately John Stokes starts by mentioning Apple. Not that it makes much sense, if you ask me, but he does.
Team
The Last Macintosh PowerPC chip (G5) was based on the Power4 CPU.
Since then we have had:
Power4+
Power5 (+?)
and now the Power6
It makes you wonder what sort of desktop chip this could have been if the G5 equivalent were made from the Power6 base.
It does make you wonder.
edit: yep typos…
Darren
Edited 2007-05-22 00:49
Power is not PPC! The G5 and the Power5 didn’t have much in common. Why to people always post this question whenever anyone mentions Power CPUs?
Power is not PPC!
Actually they are the one and the same and have been for many years.
PowerPC started as a modified version of the first POWER chips, within a couple of generations though the two recombined. These days they are now all branded as “Power Architecture” and all use the PowerPC ISA.
The G5 and the Power5 didn’t have much in common.
They were both modified versions of POWER4 so they had rather a lot in common actually, they can for instance run each other’s binaries, so can Cell.
As for POWER6 and Apple, if Apple was still using PPC they’d have most likely have announced a POWER6 based machine, unlike the POWER4 which had to be modified to run Altivec for Apple, the POWER6 was designed from the beginning to run Altivec. It’ll also scale down (less I/O etc) so Apple wouldn’t even need to have a modified version.
I doubt Apple would have any difficulty getting OS X to boot on one of these machines.
—
As for laptops no POWER6 won’t go in a laptop but could probably be modified to do so using the same techniques Intel and AMD use.
However that may not be necessary given the number of PPC chips now appearing on the market with laptop grade power consumption. Indeed it was lost in all the noise but yesterday AMCC announced a 2GHz part which runs at 2.5Watts.
[QUOTE] It’ll also scale down (less I/O etc) so Apple wouldn’t even need to have a modified version. [/QUOTE]
There is the niggling fact that it’s 341 fricking square mm!
IIRC, AMD’s new Barcelona/Phenom is over 300 sq. mm.
Barcelona is 281 mm^2, but it’s also a server processor. You’re not going to see something that big in an iMac-class machine anytime soon.
I’m gonna break my own suggestion and mention Apple here. Notice how they’re shipping 8-core (!) 3.0 GHz PowerMacs starting at $4000. They’ve gone from topping out at 2 cores with the 970MP to topping out at 8 cores with the Xeon, in the space of less than a year. Can you possibly imagine POWER competing with that kind of price/performance? How much is an 8-core POWER6 machine gonna cost, even stripped down to to target Apple’s decidedly low-end market?
Edited 2007-05-22 17:17
Only 4 POWER6 cores are needed to beat the octo Mac. In SPEC2006 that is.
The Octo Mac comparison is potentially misleading.
The 4.7 GHz Power6 has a SPECint Peak of 21.5, and a SPECint Base of 17.8
The 3.0 GHz Xeon 5160 has a SPECint Peak of 18.1, and a SPECint Base of 17.5
Now, the first thing to keep in mind is that the Peak figures are taken with PGO. PGO is great for SPEC, but almost nobody uses it for anything else. The second thing to keep in mind is that the numbers are taken with XLC and Intel C++, neither of which are relevant to 99% of code out there. The standard OS X compiler is GCC. GCC is much stronger on x86 than on PPC.
Given all that, I’d be very surprised if GCC-compiled code on the Power6 achieved even parity with GCC-compiled code on the Xeon.
“””
There is the niggling fact that it’s 341 fricking square mm!
“””
That would be 1.85 cm square. Why is that, in itself, a problem?
Edited 2007-05-22 18:52
The manufacturing cost of a chip goes up exponentially with die-size. The sweet-spot for a high-volume mainstream chip is considered to be in the ~120 mm^2 range.
Of course, the Power6 isn’t a high-volume mainstream chip. The cost equation for chips like the Power6 (or Itanium or PA-RISC) is different. For such chips, the low-volumes mean that each unit carries a relatively large portion of the fixed cost of developing the design. Thus, the sweet spot for the per-unit die size is at a much higher point.
“””
There is the niggling fact that it’s 341 fricking square mm!
“””
Power 6 is low volume, high margin chip. For class to witch it belongs, it ain’t big. In fact it is smaller than Power5. In comparsion Montecito was few mm^2 short of 600 mm^2.
I know. Nicholas Blachford contended that the core could be used unchanged in the (presumably high-volume, low-margin) Mac market.
I know. Nicholas Blachford contended that the core could be used unchanged in the (presumably high-volume, low-margin) Mac market.
I wasn’t talking about Mac Minis!
Apple has some of the highest margins in the entire industry, they have certainly never been considered as a low cost player.
Anyway, have you seen how much silicon there is in those quad core machines?
Anyway, quad core POWER5+ CPUs are already used unchanged in low end IBM machines which don’t cost much more than Apple boxes.
Yes, Apple is high-margin, but “high-margin for personal computers”. Power6 is for a market whose “low end” is way above the PowerMac level. And 341 mm^2 is pushing it, even for a PowerMac.
As for “don’t cost much more than Apple boxes”. The lowest-end quad-core Power5+ machine using the slowest-available Power5+ CPUs is $5500. The corresponding quad-core PowerMac using the slowest-available Xeons (same 4GB of RAM) costs $2900. The cheapest eight-core Power5+ machine using the slowest-available Power5+ CPUs is $18000. The cheapest eight-core PowerMac using the fastest available Xeon processors is $4000. Absolutely no contest, not in a market that doesn’t need the POWER machine’s extensive RAS features.
It should also be noted that Power5+ is substantially smaller than Power6, and is also a chip at the end of its lifecycle (so I’m not surprised to see good deals from IBM). If we see quad-core Power6 machines much under $8000, I’ll be very surprised.
There is also the question of what the heck to do with iMacs and MacBooks. Because as far as Apple is concerned, those machines are way more important than the PowerMac. On the Intel side, these machines have chips in the same league (just a lower clockspeed or fewer cores) than the ones in the PowerMac. On the PPC side what do you use? PA6T? A chip that has performance comparable to a Core 2 at 1 GHz?
Edited 2007-05-23 02:42
Power is basically PowerPC these days. The G5 was based on the Power4 microarchitecture, as was the Power5. So the G5 and Power5 have a lot in common. The main difference between them is that the G5 has a VMX unit, while the Power5 had some optimizations to the basic Power4 microarchitecture (deeper buffers, tweeks to the grouping mechanism, etc).
I wonder how much of Apple’s market growth would of of suffered staying with IBM…..
Do the new IBM Chips need a Automoble cooling system to stay cool?
Edited 2007-05-22 00:57
The PPC970 that Apple uses/used was based on the Power4. By that it was a stripped down, one-core relative. (Later it has modified and fitted into Xenon CPU (xBox360), the Cell COU (PS3, blades and friends) and somewhat unmdified in the Wii/gameCube).
The Power has been made exclusively for server-use, even though some RS/6000 desktops might be Power6-powered with time.
No, NO, NO!
The PowerPC cores in the 360’s Xenon, the PS3’s Cell, and the Wii’s Broadway processor have *NOTHING* to do with Power4 or PPC970. The single resemblence the Xenon/Cell cores share with the 970 is that a subset of their complete instruction set is the 64bit PPC instruction set. That is where the similarity ends. The Wii’s Broadway processor isn’t even 64bit!
The PPC cores in the Xenon are functionaly identical to the PPC core in the Cell. The only differences are in the cache-control mechanism and in the communication mechanism — The cell uses their “XO” communication fabric for off-chip communication and a ring topology to communicate with the SPE vector units, while the 360 uses Hypertransport (or something similar).
These cores are *not* derived from any commercially available PPC product line both Server or Desktop. These chips come from an experimental architecture IBM developed to push the limits of PPC architecture on a small, low-power die. While the PPC970 and similar power designs have out-of-order execution, these embedded PPC cores do not. These cores also impliment their duel-threaded execution in a novel way; In addition to a standard alternating scheduler, when one thread stalls (say on a memory accress) the other thread will execute.
The Wii’s Broadway processor is based on the gamecube’s Gekko processor, which in turn is based on the G3 PowerPC processor that was found in the early iMacs. Nintendo had IBM add some SIMD instructions for the Cube which overlap the FPU execution unit — basically they added instructions to process a pair of 32bit floats using the silicon from the 64bit FPU. The Broadway processor’s re-spun silicon simply runs faster, adds aggressive power-saving features and more fine-grained cache control, uses a smaller process, and likely has some minor silicon tweaks. IBM offers the exact same PPC core for the embedded market as a cheap, powerful, extremely low-power embedded CPU.
That looks like one powerful processor.
4.7Ghz top frequency & Dual Core. And was able to almost double the benchmarks from the previous Power5 chip. Cool.
Looks like it is geared for the server market right now.
What OS will work on it? I’m thinking they’ll use Linux or BSD. Any other ones?
Edit: I’ll also add that the chips are 64 bit and have 8MB cache per chip & use same amount of energy / electricity as Power5.
http://www-03.ibm.com/press/us/en/pressrelease/21580.wss
And I believe max speed is 4.7Ghz per core.
Edited 2007-05-22 01:28
Being used in IBM servers, I’m guessing the officially supported operating systems will be AIX and Linux.
I had totally forgotten about AIX ( IBM’s Unix ). Haven’t seen anyone mention AIX in ages.
Lately we are in a Linux world ( for Unix )
Sometimes it is easy to forget about the other Unixes / Unices also out there.
Well it seems like IBM is preparing to slowly put AIX into the grave (Like HP seems to be doing with HPUX).
The plan must be to replace these giants with special reinforced Linux versions – reducing their expences doing the maintnenance and development of the systems.
AIX development is going strong and there is a place for both Linux and AIX on the System p platform. There is no way Linux can offer the level of performance and hardware integration that AIX provides on IBM hardware. AIX isn’t going anywhere.
I’m pretty sure OS/400 will work on it to, maybe under the shiny new name of i6/OS?
Uh, AIX?
Are these cores reported at running 4.7ghz per core or combined? if per core then .. very nice .. id like one of those as a plaything
There is no such thing as cores speeds combined…that makes no sense…any multicore processor with its speed reported at say x ghz means each of the core is clocked at that speed. So in this case dual core @ 4.7 ghz = each core runs independently @ 4.7 ghz.
It’s a real shame that the Power line of chips are only suited for servers. Average desktop users will never benefit from all the innovations that IBM or Sun invent.
“””
It’s a real shame that the Power line of chips are only suited for servers. Average desktop users will never benefit from all the innovations that IBM or Sun invent.
“””
Yeah. I’d love to have one of those in my laptop.
As with so many other things, perhaps bundling is the answer:
http://www.autozone.com/selectedZip,73112/initialAction,partProduct…
At 160 watts per die (as Rayiner mentions) my single die Presario would run for 7 hours on the Power6/27-DLG bundle. (Assuming about 1200 watt-hours for the battery)
I’m joking, of course,
Edited 2007-05-22 01:53
Yes, we do. We benefit from competition (Intel/AMD v.s. IBM/Sun) and technology progress, just like open system today is still benefitting from mainframe technologies.
Oh, major game consoles are using “POWER line of chips”, in case you don’t know.
Oh, major game consoles are using “POWER line of chips”, in case you don’t know.
I am aware of the game consoles, but a Wii is not my laptop. Yes, average users do benefit from Intel and AMD as their server technology makes it way down to consumer chips. Specific hardware implementations from Sun/IBM can’t transition to desktop/laptop chips because they don’t make any. Intel/AMD can copy their ideas and pass them along to me, but this is an inefficient process.
Edited 2007-05-22 02:31
I still regret that Apple had to switch to Intel X86. I just like the idea of Apple computers running off of PowerPC chips more, if only to be different. Too bad IBM was not at least a little more accommodating with its PowerPC offerings.
Imagine Steve Jobs at WWDC07:
“Oh, and one more thing. Introducing the new 4.7Ghz, dual-core, PowerPC Power Mac. The new world’s most powerful personal super-computer!
Now, that one would be worth attending!
http://www.cminusgames.com
FWIW, consumers do benefit from IBM’s chip innovations. IBM is partnered with AMD and is pretty much the only firm seriously competing with Intel on process technology.
<rant>
IBM is in the rare breed of good old tech companies like HP that put significant money into R&D and bring us shiny new toys. Contrast this to Dell, and it’s no wonder they are in trouble.
</rant>
True, but HP made one stupid decision with Itanium – personally, they would have been better off adopting SPARC ISA and plonking it on a superior micro architecture.
Intel volume manufacturing and cash with HP innovation, and using an openstandard Microprocessor ISA like SPARC would have put then in a good place to compete against POWER.
With that being said, however, with features being pushed back into consumer processors; MMIO for example is going to be added in future x86 processors. IMHO the mainstream need to look at the features that have long existed within the RISC world and pull them back into the x86 world which would improve the reliability and stability of consumer level processors.
Edited 2007-05-22 04:03
HP bet bot parts of the future of the company of Itanic and lost big time.
They inherited the Alpha with the Compaq Purchase and killed it off.
IMHO, they should have seen the writing on the Wall and ditched the Itanic in favour of the Alpha Architecture but I suspect the contract with Intel was a big hurdle in doing this.
HP is as bad as Microsoft with the FUD.
When Alpha was launched HP put out a spoiler ad campaign basically saying
“Who Needs 64Bit? Not you”
Back on Topic.
The Power 6 Architecture is so far removed from the majority of CPU’s that Intel are turning out to make most comparisons very difficult.
I’ll personally applaud Intem when the ditch their current X86 Arch and especially the way they do memory accesses. The AMD way is far superior.
Killed it off ??
More likely sold it of to Intel
They should have continues with the Alpha CPU:s.
They were killer CPU:s at that time and could still have been.
Or as said in another post, they could have adopted the Sparc core.
Or amd64
I’ve been writing an assembler for amd64 for the past two weeks, and from what I’ve seen so far, I like it a whole lot more than PPC or SPARC.
Comparison of x86-64 instructions, and their equivalent in PPC.
add r10, r12 is 3 bytes and 1 issue slot on x86, 4 bytes and one issue slot on PPC.
add r10, [r12] is 3 bytes and 1 issue slot on x86, 8 bytes and two issue slots on PPC.
add r10, [r12 + r11*8 + 24] is 5 bytes and 1 issue slot on x86, 12 bytes and three issue slots on PPC.
push r10 is 2 bytes and 1 issue slot on x86, 8 bytes and two issue slots on PPC.
mov r10,0x123456789ABCDEF is 10 bytes and 1 issue slot on x86, 20 bytes and five issue slots on PPC.
All of these are 1 byte shorter for x86 if using the lower 8 GPRs and 32-bit operations.
Pretty neat for an architecture that supposedly sucks so bad…
First you’re cheating with x86-64: x86 sucks bad, x86-64 is only not very good.
Comparing the byte length of instructions is only one metric, CISC’s variable length encoding makes it more difficult to decode, which means that for a similar amount of money, a CPU maker would develop a RISC CPU with better performance than a CISC CPU.
Of course Intel have more money to spend developing x86 CPUs than most other CPU makers..
That said, it’s also possible to make “RISC” CPUs with good instruction average length: ARM Thumb2 for example, they provide 16 and 32 bit operations which is a good compromise: easier to decode than byte-length instructions but provide a ‘good enough’ instruction density, comparable to x86.
CISC’s variable length encoding makes it more difficult to decode
Intel engineers can do it with closed eyes.
CISC instructions, usually do more work/time.
Less commands to issue, less memory/cache bandwidth and execution resources requirements.
The advantage of RISC is a simple hardware with nice performance, while the CISC design of the comparable simplicity will have much worse performance. Theoretically, RISC vendors can provide more execution units for the same transistor count because of simpler hardware.
But x86 camp can outdo any (performance) RISC vendor in all of price/performance/power ratings.
Lets compare K8 vs ppc970/G5.
3 IU 3 FPU vs 2 IU 2 FPU
AMD has fast low latency units while G5 has shameful high latency simple units.
G5 OOO is weak and suffer from a lot of stalls.
G5 is slower and eats ~2 times more power.
Personally i don’t like PPC as ISA, but it have some nice big-iron functionality like proper virtualization, etc.
Now we have Power6. Where best performance live with insane price and power consumption.
Edited 2007-05-23 11:35
“But x86 camp can outdo any (performance) RISC vendor in all of price/performance/power ratings.“
No you can’t.
Sole reason why you can compare G5 to x86 is because abundances of money and resources in x86 world. Great performance levels of x86 chips, are based on that pillar. Once you remove that pillar, that statement will crash.
“CISC instructions, usually do more work/time.“
In theory yes, I would agree to that statement. In practice no. I couldn’t agree less.
Chips like SuperH, ARM Thumb or MIPS-16 can and do have higher instruction density exactly because smart rethinking of their RISC roots.
But even if you take more common RISC designs, they will have instruction rate similar in metrics to x86, as present day RISC chips are abundant as x86 in features and instructions and whatever metrics you pick.
But truthfully speaking, it doesn’t matter any more. X86 is king of hill. One day everything else will be based around it.
SuperH, ARM Thumb or MIPS-16
These are not a high performance parts.
I like ARM and actually i have some ARM7 asm coding experince.
But even if you take more common RISC designs
ARM is the most common RISC design =)
There are only 1.5 “big” RISCs left – Power and SPARC.
X86 is king of hill. One day everything else will be based around it.
Not everything, but to all appearances (unfortunately) high-performance ARMs will be kicked out of complex gadgets in a matter of several years.
“these are not a high performance parts.”
This is completely misunderstanding of argument and purpose of those chips. Those cpus are excellent performers. Not in absolute speed of code when compared to something like Power6. But at problems they solve and within constraints of usage of those cpus (limited design budget, low power budget and amount of transistors, and cheaper to purchase than shoelace) it is impossible to design x86 chip (or any CISC for that matter) capable of delivering even comparable amount of performances. So there are usages where risc clearly brings better performances. Thats the point.
“Not everything, but to all appearances (unfortunately) high-performance ARMs will be kicked out of complex gadgets in a matter of several years.”
I agree. There will be one architecture to rule them all.
Edited 2007-05-23 21:54
This is completely misunderstanding of argument and purpose of those chips.
I know, but my original statement only apply to PPC and SPARC as the last survived top-performance RISCs.
But at problems they solve and within constraints of usage of those cpus
True, but that doesn’t mean ARM can replace, say, Power970.
ARM is not a processor of that caliber
Edited 2007-05-23 22:10
Argh fast reading problem.
With that in mind your statement is completely true. Economy of scale and abundant transistors numbers makes it true. Architectures don’t matter to much anymore.
Sorry on that.
“Not everything, but to all appearances (unfortunately) high-performance ARMs will be kicked out of complex gadgets in a matter of several years.”
ARM processors sell in quantities which dwarf the x86 world, I don’t think it’s ARM which is likely to get kicked out of anything. x86 processors rule the desktop world and part of the server world but they don’t even count in the embedded world.
I agree. There will be one architecture to rule them all.
That’s looking less and less likely, while x86 did manage to kill of some of the RISC competition the opposite is now happening with very high performance exotic processors appearing (GPUs, Cell, Niagara), all of which are RISC and all are in-order. The day of a architecture type ruling everything is coming to an end.
The day of a architecture type ruling everything is coming to an end.
You’re too optimistic, Nicholas.
Check out new intel announces. They’re seriously looking for embedded. Their solutions are still not competitive, but who knows what they will show in next years.
In theory yes, I would agree to that statement. In practice no. I couldn’t agree less.
In theory and in practice. Both modern lines of x86 chips use the CISC-y nature of x86 code to increase dispatch and execute bandwidth. In theory, Core 2 is a 4-issue design, but in practice the decode/issue bandwidth can be quite a bit higher, because it can issue a load+op as a single instruction. K8 is a 3-issue design, but with the right instruction mix can behave as up to a 6-issue design, because the basic unit of issue is a load+op macro-op.
amd64’s variable-length immediate and complex addressing modes also reduces both code size and saves issue/execute bandwidth. Loading a 64-bit constant can be done with a single micro-op on K8 and Core 2, but is a sequence of 5 instructions in PPC64. x86 can do a load with an index and displacement in a single AGU micro-op, while it takes 2 instructions on PPC. And RIP-relative addressing is just a plain good idea.
And RIP-relative addressing is just a plain good idea.
On ARM it is very handy to have constant pool/data near the code. On 68k, 100% position independent code could be done as well.
It is shame what PPC doesn’t have PC relative addressing, so a kind of perversion with TOC is required.
However, direct access to the PC register can potentially require tricky pipeline design and possibly some ineffectiveness. As far as i understand.
But the truth is that 99% of immediate values do not need to be 64 bit, nor 32 bit either.
Addresses most likely contained in varios arrays/tables.
Globals are rarely used in any sane c++ code.
Mostly ancient C code suffer from globals abundance.
Edited 2007-05-23 19:37
Globals are rarely used in any sane c++ code.
And of course C++ is the only programming language anybody ever uses…
In languages that don’t suck, it’s quite useful to have quick access to various global data structures, often utilized by the runtime.
That said, full-size immediates are not always needed. However, the x86-way of handling immediates is a lot cleaner than the typical RISC-y “load 13 bits of a register at a time” method.
In languages that don’t suck
Ohh, can you show the list here?
it’s quite useful to have quick access to various global data structures, often utilized by the runtime.
I don’t think this is a clever solution.
Anyway RISCy solution for this is Global Pointer.
Actually the lack of global pointer and absolute addressing forces to use various tricks to enable reentrant code with global variables.
However, the x86-way of handling immediates is a lot cleaner than the typical RISC-y
If you need (most likely) to operate on full register width on x86-64, immediate will be 32 or 64bit, even if only few bits are used.
8B0425 01000000 movl 1, %eax
030425 9CFFFFFF addl -100, %eax
4C8B1425 01000000 movq 1, %r10
4C031425 9CFFFFFF addq -100, %r10
In every case of small immediate, RISC instruction is twice as small.
Edited 2007-05-23 21:47
“Ohh, can you show the list here?”
Java for example. It’s full of them.
Ohh, can you show the list here?
ML, Lisp, Smalltalk, dozens of others. Generally, there is a pretty strong correlation between “not sucking” and needing extensive runtime support services.
If you need (most likely) to operate on full register width on x86-64, immediate will be 32 or 64bit, even if only few bits are used.
Most x86 instructions have a form that sign-extends an 8-bit immediate. MOV doesn’t, but loading a small immediate is an uncommon operation in an ISA in which almost all arithmetic instructions take general immediate operands.
Your examples are all very poorly encoded. You’ve got unnecessary modrm and sib bytes in there.
8B0425 01000000 movl 1, %eax
My assembler gives B8 01 00 00 00 (5 bytes)
030425 9CFFFFFF addl -100, %eax
I get 83 C0 9C (3 bytes)
4C8B1425 01000000 movq 1, %r10
My assembler gives 49 BA 01 00 00 00 00 00 00 00 (10 bytes), but there is a 7 byte encoding (I should fix that).
4C031425 9CFFFFFF addq -100, %r10
49 83 C2 9C (4 bytes)
ML, Lisp, Smalltalk, dozens of others.
Which program you frequently using has been written in these languages?
As i said, Global Pointer is more flexible
and more compact way to use globals.
On PPC you don’t need to load any 64bit immediate
because you have TOC.
And finally, globals are forbidden on some systems.
Your examples are all very poorly encoded.
Well, this is what gnu-as generated for me.
Anyway if i put -130 instead of -100, my examples will be valid.
Even in your examples, RISC has advantage.
5 3 7 4 = 4.75 average
7 7 8 8 = 7.5
versus
4 4 4 4 = 4.0
Which program you frequently using has been written in these languages?
Emacs, Maxima, Darcs, OpenDylan, etc. I’ve spent most of my day for the last few weeks in Emacs/SBCL. I spend even more time in front of Matlab (which is based on Java and a custom dynamic language). IMHO, with the rise of Java and OODLs, RISCs (which are very much FORTRAN-machines) are looking a lot less attractive.
As i said, Global Pointer is more flexible
and more compact way to use globals.
A global pointer is fine for thread-local variables, but less great for function-local constants (and in HLL’s which allow things like constant lists or structs, you may have a lot of these). RIP-relative addressing saves you a lot of relocations in this latter case.
Well, this is what gnu-as generated for me.
You must have used it incorrectly. I have a hard time believing GNU AS generates such crappy code.
Anyway if i put -130 instead of -100, my examples will be valid.
That’s really reaching for straws. Small immediates are there for things like structure offsets, shift counts, loop/pointer increments, small bitmasks, etc. All of those will almost always fit into 8-bits. Those that don’t will be things like global pointers, full-word bitmasks, etc, that won’t fit into 16-bits either. About the only thing I can think of that’d fit into 16-bits and not 8-bits would be something like a row-stride in a large constant-size matrix.
Even in your examples, RISC has advantage.
Only because *your* example had an even mix of MOV and ADD instructions. MOV doesn’t have an efficient byte-immediate encoding, because there is almost never any reason to MOV a small immediate into a register. x86 isn’t a load/store architecture. If you need a small immediate as a source operand, just use the reg/imm form of the instruction — no need to load into a temporary register first. MOV’ing an immediate into a register only helps in two cases:
1) When the immediate is large and is used repeatedly by subsequent instructions. This reduces code-size by using a smaller reg/reg form instead of a reg/imm form.
2) When adding two immediates together.
The second case is useless (unless your compiler sucks), and in the former case, MOV pays no code-size penalty relative to RISC for the large immediate.
Edited 2007-05-24 14:31
First things first, I was aiming at historical reference that cisc chips do have higher code density which does translate to better performance levels because improved locality and higher semantic power of their instructions. In some kind this argument was presented on this thread by few people and this is theoretically speaking true. But modern riscs are abundant in features and are their instructions sets are full of instructions which do quite complex things. And some designs like SH or Arm Thumb series, with 16 bit instructions while still being 32 designs, do directly attack “higher code density” advantage of cisc approach. And are very successfully at doing so. So that’s why I said that this argument doesn’t hold in practice.
Now about your post. Probably, because of it accumulator based concepts, one can treat x86 ASM as high level intermidied language somewhat easier than usual risc design; but to state that multi-issue dispatch or macro-op load+op fusion is something which is exclusive to x86 because of it cisc nature is completely different subject. I don’t think that those concepts have anything in common with cisc or risc for that matter. Instead they are consequence of virtually unlimited transistor and resources budget of companies that design x86 chips.
And because x86 sucks in the first place. X86 needs to do does kind of things to be competitive. Risc usually don’t. For example look at power line. Power 3, 4, 5 and PPC970 are derivate of same general design. Those cpus have always been one of the best out there often taking first places on many different benchmarks. I bet that development of those 4 designs needed less money combined, than development of single x86 architecture to which any of them was comparable at given of point of time (P6, NetBurst, K7 make your choice).
Verilog/HDL design for something as SuperH can be done within two-three years with only handful of people and it will still be very competitive. With x86 that is virtually impossible. That is the true power of “risc approach”.
Sorry on long post..
Edited 2007-05-23 22:08
Now about your post. Probably, because of it accumulator based concepts
x86 still shows accumulator roots in some places, but the changes to the ISA in the 286 really made it a REG/MEM design. The modrm/sib mechanism of operand encoding allows you to use registers, memory, and immediates in a very clean and (dare I say it) orthogonal way.
but to state that multi-issue dispatch or macro-op load+op fusion is something which is exclusive to x86 because of it cisc nature is completely different subject.
Load-op fusion would be much, much harder in a RISC. In x86, a single instruction gives you two pieces of information:
1) The fact that the op consumes the results of the load (serial dependency).
2) The fact that nothing else consumes the results of the load.
In x86, you don’t really even have to do any “fusion”. The decoder just emits a fat micro-op in response to a complex x86 instruction. In a RISC, you’d have to look across instruction boundaries searching for the serial dependency, and then throughout the rest of the pipeline you have to deal with the fact that the load creates architecturally-visible state that may be consumed by any number of instructions not yet seen.
Those cpus have always been one of the best out there often taking first places on many different benchmarks.
The PPC970 was at best mediocre. POWER4/5 did great in server space, because of IBM’s incredible memory subsystem/interconnect, but the basic core wasn’t so hot. Given the sheer execution resources of the 970 (4+1 issue, 8-way execute, 5-way retire), it’s incredible that it performed as badly as it did. The vastly simpler 3-issue, 3-way execute, 3-way retire K8 was all over it in real world performance.
“Load-op fusion would be much, much harder in a RISC”
Well certainly its true if you need to do it in a same way as x86. But I don’t think risc would do it exactly that way.
On closer look at load + op, it is not new concept. Compiling code to high level intermediate code in order to brake interdependencies between variables an than create higher density instructions from that optimized code is at least 25 years old concept. And very popular too. For example Java does it. The only thing new here is that ASM level is treated as intermediate code and done in hardware. It would be trivial thing to do it in design as Ajile aJ-100.
I don’t see reason why same code compiled for risc architecture *needs* to have higher amount of interdependencies between values stored in registers when compared to CISC. Ok C = A + B compared to A = A + B certainly needs one architectural register more than CISC way, but as every risc has at least four times more explicit registers, risc still wins. Add smartly chosen register renaming and well made compiler and I bet you could do it in comparable way to cisc with keeping design in same level of complexity. Sole problem with this kind of strategy would be increase in number of instruction when compared to code we generate today. But even that can be solved with something as P4’s caching after decode.
So yea I don’t think it is hard to do it in risc. It is only more obvious to do it in cisc.
Of course it is certainly not case that today risc designs do something like this, but it is primary because they don’t have as much need to do it in the first place.
Speaking of PPC970, it was cool design. Yes it had it’s problems. Heard for weak memory interface. And years of waiting for decent gcc implementation in order to actually use strengths of it. I don’t think it performed _that_ badly. At the time of introduction it was killer chip. I don’t know how much money was spent on development of that design, but I bet it was order of magnitude lower than anything in x86 world. And that it self speaks about strengths of risc ideas. If x86 was as simple to develop I believe there would much more competition in x86 than we have today, or we would be further away at performance spiral, or something between those two extremes.
Well certainly its true if you need to do it in a same way as x86. But I don’t think risc would do it exactly that way.
How else are you really going to do it? The problem is that RISC instructions don’t explicitly encode the serial dependency and unnamed temporary. This means that you have to do a lot of work on the instruction stream to recover this information, which defeats much of the benefit of doing the optimization in the first place.
But even that can be solved with something as P4’s caching after decode.
Which is really a horrible idea (which is why Intel abandoned it). Decoded instructions are huge, and effective instruction cache size is very important. It’s a speed/size tradeoff that made sense back when memory was as fast as the core, but doesn’t now that memory is so much slower than the core. Much like RISC itself really…
Of course it is certainly not case that today risc designs do something like this, but it is primary because they don’t have as much need to do it in the first place.
The conventional wisdom is that the load/store subsystem isn’t as important for RISCs because memory operations are much less common due to the extensive register file. This was a wisdom that made sense with FORTRAN code, but doesn’t make sense for modern languages that make extensive use of indirect data structures.
Consider something like a type-check in an object-oriented language. Say the class pointer is the first field in the object, and a type-check consists of comparing it the class pointer of a known type. In x86, this is encodable in a single instruction, a cmp instruction with a memory operand and an immediate operand. Total size: 7 bytes. On PPC this is four instructions and 16 bytes (assuming 32-bit headers). On a K8 or Core 2, the former sequence takes up a single issue slot and a single ROB entry. On a PPC, the former sequence will take up four slots. In fact, the type-check and subsequent branch will take up an entire five-instruction dispatch group on the G5!
These sorts of operations aren’t that common in FORTRAN, but are extremely common in modern dynamic languages, or in business/AI applications written in any language. x86’s have had really sporty memory subsystems for a long time to make up for the small register file. The P6 core could do a load and a store per cycle, while competing PowerPC chips were a load OR a store per cycle until the introduction of the G5 many years later. A side-effect of this is that modern x86 chips tend to fly on complex code with indirect data structures.
Speaking of PPC970, it was cool design.
If your goal was to run FMAs really fast.
Yes it had it’s problems. Heard for weak memory interface.
Weak memory interface, terrible scheduling restrictions, 2-cycle integer latency, smallish cache, and tendency to split all the useful multi-side-effect PPC instructions.
At the time of introduction it was killer chip.
Until the Opteron was released a month later…
I don’t know how much money was spent on development of that design, but I bet it was order of magnitude lower than anything in x86 world.
IIRC, Power4 and Power5 were on the order of $300-$400 million each. It’s unlikely that K8 was even twice that amount. I don’t understand why you think RISCs are so much cheaper. If you’re committed to doing a competitive OOO implementation, a simpler decode stage isn’t going to save you much. The real reason Power4/5 was cheaper was because IBM used automated design techniques much more extensively than Intel/AMD do. I doubt that advantage holds for Power6, which is allegedly heavily custom.
First you’re cheating with x86-64: x86 sucks bad, x86-64 is only not very good.
Cheating how? I didn’t say anything about x86, just x86-64. And x86-64 is very much in the same spirit as x86.
Comparing the byte length of instructions is only one metric, CISC’s variable length encoding makes it more difficult to decode
And who says ease of decode is the most important metric? Maybe it was once, when CPUs were much simpler beasts, but now?
which means that for a similar amount of money, a CPU maker would develop a RISC CPU with better performance than a CISC CPU.
I’m skeptical. x86-64 code can be half the size of PowerPC code. Saving a few pipeline stages in the decode step is unlikely to offset the cost of effectively halving the size of the instruction cache.
>Cheating how? I didn’t say anything about x86, just x86-64. And x86-64 is very much in the same spirit as x86.
It was a tongue-in-cheek comment, but not far from the truth: the 8 to 16 registers of x86-64 can bring up to 20% of performance improvement which is *huge*.
>And who says ease of decode is the most important metric?
I didn’t say that it was the most important metric..
If you’re a CPU maker with little cash, it’s quite important, if you’re Intel not so much.
> halving the size of the instruction cache.
Depends on which cache and CPU..
If memory serves, on the P4, Intel stored in a ‘trace cache’ the *decoded* instructions, so here there is no gain in size in this cache, of course there is still a gain in size on the other cache and a gain in bandwidth usage.
> x86-64 code can be half the size of PowerPC code.
But the PowerPC is not the only RISC ISA! As I’ve said an ARM Thumb2 ISA is a 16/32bit RISC-like ISA nearly as easy to decode as a 32bit only ISA but with the instruction density more similar to x86-64.
>Saving a few pipeline stages in the decode step is unlikely to offset the cost of effectively halving the size of the instruction cache.
Frankly this is quite difficult to say, at this level it’s all a matter of compromise..
Usually, the p-series and i-series get the new Power chips first. I can’t wait for i6/OS for out shiny new as/400… er iSeries…. dammit no, err i5. Yeah that’s it
We I logged on and saw 25 comments they were from people who had some hands on experience with the chip, the early
testers ya know? Should have known better.
Anyways, we have a few of the test machines at one of our larger data centers, too bad not at the one I work at *sigh* ( I work at a legacy one)
The POWER6 should be the hands down performance leader in the market. In fact, the POWER5 and POWER5+ are still kicking everyone’s ass even as the POWER6 is being announced. The POWER5 chips beat out and HP Superdome in a TPC benchmark with half the number of processors.
What are Sun and HP doing to compete?
A processor distributed by IBM with a frequency of 4.7Ghz. Of course, I can’t help but think “4.77GHz” when I read it.
Seems like only yesterday!
Edited 2007-05-22 20:07