AMD’s CFO Devinder Kumar recently commented that AMD stands ready to manufacture Arm chips if needed, noting that the company’s customers want to work with AMD on Arm-based solutions. Kumar’s remarks came during last week’s Deutsche Bank Technology Conference, building on comments from AMD CEO Lisa Su earlier in the year that underscored the company’s willingness to create custom silicon solutions for its customers, be they based on x86 or Arm architectures. Intel also intends to produce Arm and RISC-V chips, too, meaning that the rise of non-x86 architectures will be partially fueled by the stewards of the dominant x86 ecosystem.
This is entirely unsurprising news. You don’t have to build Snapdragon or Apple-level ARM chips to make a lot of money with Arm, and companies like Intel and AMD would be stupid not to look into it.
Pretty sure in the list of ‘signs the world is going to end’ is Intel making non-x86 chips.
Sure, I can see them making ARM as control boards, or even things like fpga boards with ARM SoC. But selling individual ARM chips for PC Builders would be when they have given up. Guess we are not there yet as I don’t even think anyone sells just the chip to consumers for a DIY PC build… I think that is also when Intel will actually start to worry…
Why would they have to give up? Instruction sets don’t really matter or convey any real advantage.
What matters is the quality of code the compilers produce, and the micro architecture that executes the instruction set.
Everything else is about compatibility
ARM and RISC-V might be more extensible then x86, so when it comes to adding accelerators and custom silicon, ARM and R-V might have an advantage.
Instructions sets matter. X86’s variable length makes difficult to decode many Instructions in parallel.
Instruction decoding hasn’t been a major limiter to performance for over 20 years. Out of order made the debate between fixed vs variable ISA encodings moot.
If you scale the M1 Firestorm cores and the Zen3 cores to the same node, you end up with remarkably similar area and IPC results. The out-of-order microarchitecture makes a bigger difference than the ISA encoding.
The higher complexity of x86 decoding is cancelled out by the higher instruction fetching requirements of ARM to produce the same volume of microops
javiercero1,
You say this even though you already know there’s disagreement on this! It’s well established that complex instruction encoding adds inefficiency in instruction prefetching. Obviously engineers try to mitigate this inefficiency with u-op caching so that the execution unit isn’t directly exposed to the complex instructions, but it does add latency going from the x86 instruction stream to microops and the caches are inherently << RAM, only a few thousand entries.
https://www.anandtech.com/show/14514/examining-intels-ice-lake-microarchitecture-and-sunny-cove/3
Another thing to note is that this complex prefetcher cache contains state information, which leaves it potentially exploitable ala meltdown/spectre style timing attacks. Here too we see a case where x86 ISA complexity is bad.
I'm confident that Intel engineers would be able to improve on x86 which was designed decades ago at the dawn of PC computing. However they aren't merely dealing with the technical aspects of the problem, they have to deal with the marketing aspects as well. There is still tons of demand for x86 and it's been a cash cow since the start. It wouldn't make sense to abandon x86 despite a suboptimal power hungry design.
Things may finally be changing though as Intel's monopoly is up against more credible competition with both apple and google designing processors that are not dependent on x86.
“inefficiency” in this case is a subjective qualitative term you’re pulling out of thin air.
x86 pays the price in complexity and trace caches in the Fetch Engine. Whereas RISC pays the price in higher instruction bandwidth and larger instruction cache space requirements.
Both approaches take similar transistor budgets to generate a similar volume of microops to be passed on to the Execution Engine. So it’s a case of Tomato Tomatoe. It’s the same result in regards to complexity and area wise.
AMD and Intel require very bizantine decoders and trace caches arranged in a relatively narrow set of parallel decoders. Whereas Apple required the largest L1 I-Cache and a very wide set of parallel decoders. Both end up with a similar IPC at the same fab node.
ISA Encoding hasn’t been an issue in the Microarch architecture for decades.
It was an issue in the 80s and early 90s when the Instruction Fetch stage took a significant amount of transistors in the pipeline budget. But now that has been reduced to a trivial amount compared to the rest of the out-of-order structures.
Besides, whether you are using a RISC or a CISC approach, every modern high performance uArch ends up breaking those instructions into microops. So it makes even less of a difference once you leave the fetch engine. In this case, AMD is literally reusing the same core and just changing the front end to execute x86 or ARM.
things like meltdown/spectre are orthogonal to instruction encoding, as they affect every speculative architecture be it x86 or RISC. If anything, x86 strict memory ordering may make it less open to this type of attacks than ARM’s relaxed ordering.
The whole “monopoly” nonsense is starting to get old. and it’s more emotional drivel in lieu of actual knowledge of the matter at hand. In this case micro architecture.
It will be an “apples to oranges” comparison, but the infamous Apple “M1 chip has 8 instruction decoders, while Intel has 4”: https://news.ycombinator.com/item?id=25394447
Of course an Intel instruction can map to two or more ARM instruction, that is why it is not a straightforward comparison. Nevertheless it is much easier to expand decoder queues on ARM than Intel.
Yes ARM require less resources to decode, x86 need less fetch bandwidth.
The M1 has lots of parallel decoders which are fed by basically the largest L1 ever in a core.
An x86 with half the decoders achieves comparable IPC at the same node.
It doesn’t really matter where you put the complexity, at the end of the day you end up with similar black boxes for either fetch engine to produce the same volume of microops tot he Execution Engine.
Again, instruction encoding hasn’t been a principal limiter to performance. Specially when out of order structures are taking a order of magnitude more in terms of resources.
javiercero1,
Not for nothing, but you pull everything out of thin air always and expect that to be enough to convince me, but it’s not….if you want to prove me wrong, data is the best way to do it! Seriously, that is much stronger than arguing your opinion.
The ISA itself is orthogonal, but the CPU’s implementation of it isn’t. There’s a real risk that a complex prefetch unit can carry state between security domains.
http://www.cs.virginia.edu/~av6ds/papers/isca2021a.pdf
Yes x86 pays a price in complexity. But you’re assuming it makes up for it in instruction density, but just because it uses complex variable length instructions doesn’t mean those instructions are more optimal in representing the logic of typical software. Rather than telling me I’m wrong, show me the data!!
They don’t have the same node size today, anyways I’m going to ask you to cite your data.
I accept that’s your opinion, but let’s be honest about something: experienced engineers wouldn’t make an architecture like x86 again because the prefetch inefficiency adds an unnecessary cost in terms of complexity/transistors/power/latency. We can agree that the costs are manageable and not outrageous, certainly in the “good enough” territory. But all else being equal it’s a cost that we don’t have to pay with a better optimized ISA.
sukru,
Yes, the decoders being simpler for ARM means they can add more for a given transistor cost. Prefetch complexity is a huge advantage for the ARM ISA over x86 IMHO. I think now that x86 is getting real competition, I expect it’s going to be harder for intel to keep up with x86. They may throw enough resources in to do it, but it’s always going to use more power.
@ Alfman
Go browse thought he proceeding of ISCA, MICRO, HPCA, or ASPLOS during the past 25 years. I recommend you take a basic microarchitecture class first so you understand the basics.
Funny thing. X86 was the first architecture with instruction prefetch. The ISA was literally designed to be prefetch friendly from the 8086.
javiercero1,
Claiming you are right and then not citing evidence or data is exactly what HollyB does over and over again.
I can turn around and tell you that I’m right, and that you need to go browse through he proceeding of ISCA, MICRO, HPCA, or ASPLOS during the past 25 years. I also recommend you take a basic microarchitecture class first so you understand the basics.
See? Your argument here is so generic that it can be used to back the exact opposite of what you are claiming, which is pointless. So I implore you to make a stronger argument, provide the data and sources to make your case. It’s your prerogative not to, but then don’t get upset when I call you out on a lack of data to back your claims. You’ve made several unbacked assertions, go back them up.
@Alfman
Knock it off. I’m reading nothing but men measuring the size of their ding-a-lings in this topic not anything useful and I’m not getting sucked in by your whataboutery or jav’s blatant trolling.
@Alfman
I’m an expert in this field, I’m sharing my hard earned knowledge in good faith.
I don’t have the inclination to waste energy trying to write a dissertation trying to earn your “validation” on a matter you have close to ZERO mastery of.
Cheers.
javiercero1,
Your “expertise” is meaningless to me as we’re anonymous posters on the internet with no verifiable qualifications. So no, your arguments aren’t above fact checking and the need to provide convincing data. Neither are my arguments for that matter, but the big difference is that I’m keen to use data and cite sources to defend my points of view. This is such a critical step, and when you refuse to do this, it sets off red flags. It’s not intended to be an insult to you, but assertions that aren’t backed by anything but one’s word deserve a greater degree of skepticism.
Credibility is earned, not given. You continually use ad hominem attacks against myself, but it should be noted that the only thing I asked you for was data & links to prove your claims. I’m a stickler for data and proofs, which is a completely reasonable thing to ask for. While this really annoys you, that’s neither my problem nor my fault.
Let me take a specific claim and ask you to defend it:
The reason I ask is because in analyzing hundreds of binaries from debian repos, I am finding that x86_64 binaries are actually larger than ARM binaries on average.
I’ll list some examples:
/usr/bin/cinnamon x86_64=18,696B arm64=14,520B
/usr/lib/_arch_/firefox-esr x86_64=634,504B arm64=601,656B
/usr/bin/gimp-2.10 x86_64=9,844,888B arm64=9,828,816B
/usr/lib/_arch_/13/bin/postgres x86_64=8,152,976B arm64=8,000,928B
/usr/sbin/postfix x86_64=18,504B arm64=14,344B
/usr/sbin/sendmail x86_64=30,872B arm64=26,696B
Out of 611 object binaries I compared, 579 or ~95% were smaller for ARM64 than x86_64.
In terms of the ISA debate, this would mean that NOT ONLY does “x86 pay the price in complexity and trace caches in the Fetch Engine”, as you indicated, but it ALSO “pays the price in higher instruction bandwidth and larger instruction cache space requirements” at least for a good deal of software out of Debian repos. You asserted x86 would hold the advantage here, but a lot of binaries seem to go against your argument and give ARM the advantage.
I wish Debian repos included RISC-V since I’d love to compare it too. Oh well.
I want you to appreciate the way I’m defending my skepticism of your assertions, I’m not using argument from authority or self-proclaimed expertise. I’m just looking at what the data tells me. I think we could get past this arguing if only we could focus on data instead of egos to defend our points. Wouldn’t you agree?
JFC here we go again:
Nah, saying that you have no clue about this matter is not an “ad hominem,” it’s a statement of fact. You literally have no idea what you are talking about.
Your “skepticism” is just your insecure boomer self reading about concepts that are outside of your pay grade, and doing the normal boomer thing that it must obviously be the “world” that is wrong, because that “hello world” in x86 assembler 30 years ago was the height of achievement in the computing field.
Of course it would be easier to just expand your understanding by listening to what a person, ACTUALLY working on the matter at had, is teaching you for free.
Hopefully, what I write is read by younger visitors interested in actually learning about what is going on in the uArch field.
In any case. You assertions, as usual, are faulty. You can’t extrapolate any of what you’re trying to do based on binary size alone.
For starters ARM and x86 have very different library and linking policies, on top of vastly different memory consistency models.
And the compilers involved are generating vastly different code and optimizations.
A larger binary could be, actually, a more efficient implementation. Since loops could have been rolled or vectorized (esp. given that SIMD engines in x86 are much more capable than ARM). Furthermore, x86 compilers for the past decade and a half have implemented a lot of thread level speculation, which increases code size with speculative helper threads for better cache behavior. Which is not used in ARMland.
Etc. Etc. Etc.
In any case. I think I said it earlier, but I will keep my word from now on: I have zero interest in any further interaction with lifetime members of the Dunning Kruger Club.
Bye.
javiercero1,
I’m just asking you for data and this is how you respond. Yes, these are ad hominem attacks, and unprompted at that. For someone who is desperate for others to respect your experience, you really have a problem showing respect to the “boomers” who literally have more experience than we do. Also, do you really think I’m the insecure one here? That’s just funny.
You’re argument was that x86 code required less bandwidth to fetch, and less cache to store. It sure does put a big dent in your argument though if the x86_64 code turns out to be bigger.
Sure, you can argue that the CPU’s execution units are more powerful (although you should know by now that you have to SHOW that with data and not merely SAY it). Nevertheless, you are moving the goalposts. Your initial claim was about the decode stage itself and I quote “Yes ARM require less resources to decode, x86 need less fetch bandwidth.”, however given that the x86 binary may be larger it may need more bandwidth and cache than it’s ARM counterpart. Ergo, you can see why a rational person could be skeptical of your assertions.
Are you going to keep resorting to ad hominem attacks or address the problem with your assertion?
That’s fine. I’ll end on a note from Data…
https://www.youtube.com/watch?v=HxJA_7Y3DvU
@Alfman
I’m showing you the same respect that you have shown me.
I’m an expert in this field and I work in it for a living. I’m sharing information with you in good faith. I don’t have the inclination to spend a couple of hours digging through conference proceedings just to humor some entitled boomer, with zero competence in the matter at hand.
This is not an academic forum, nor a place for peer review, we’re not even remotely peers.
I’m politely distilling and oversimplifying concepts to relay the information to you. You know very little in this matter, which is why you don’t realize how little you know to the point you don’t realize we don’t share, even remotely, the same frame of reference for me to bother with an actual debate with you.
In usual boomer fashion, you think the world revolves around you. And rather than been grateful for new information coming your way. You have to do your idiotic ‘gotcha’ games, on stuff you don’t understand.
Binary size has no correlation with my claims regarding X86 having a different resource pressure in the Fetch Engine compared to ARM.
The thing is that, modern aggresively spepulative out of order uArchs are a total mindfuck for old farts like you. Because you’re stuck in your glory days when you wrote that “hello world” in 8086 assembler on perhaps a 386 or 486. Where the ISA exposed the underlying architecture of the actual chip.
For example. A lot of you guys are talking about x86 having n-decoders in parallel vs Apple’s M1 having m-decoders. Where m > n. Thinking that where all the “decoding” happens. When in reality those are more akin to pre-decoders operating speculatively under the guidance of the predictor structures in the FE.
The thing is that those “decoders” are not what you think they are. Because instruction decode in a decoupled architecture is a distributed out of order process. Where program instructions are first pre-decoded, broken down into uOps and dataflows that are passed onto the Execution Engine. Where they are further decoded completely outside the view of the programmer.
The point you’re not grasping is that in the end. Both x86 and M1, end up with very similar “decoding” bandwidth, with different allocation of resources across the FE an EE boxes of the architecture. As reflected by the fact that both Zen3 and Firestorm, for example, achieve relatively similar IPCs. With a slight edge to M1 due to it’s slightly wider architecture.
And these resources in either case are about <5% of the overall core in terms of area/power/complexity. So it does make very little difference if the ISA are complex or simpler from the perspective of the superscalar out-of-order execution units/resources which make up most of the core's dynamic logic budget. Which is why it's a debate that died down in the uArch community long ago, ever since ISA and microachitecture became decoupled concepts. About 20+ years ago.
In the big scheme of things, ISA complexity stopped being a major limiter to performance compared to uArch components like the Branch Predictor, the size/#banks/#ports in the register file, the reorder-buffer and memory model, #of FUS, Victim/Trace/etc Caches, Memory Controller, etc, etc, etc all of which have far bigger impact on performance than decode complexity (or lack thereof).
I'm writing this, not in hopes you understand any of this (I'm sure you'll find some idiotic "gotcha" bullshit argument). But hopefully it reaches some younger audience in this site who is interested in what's going on in modern CPU architecture.
See the thing is that "decode" does not mean what you think it means in a modern x86 or ARM part. It's more akin to a pre-decode, since
javiercero1,
Except that your posts have been ~90% ad hominim attacks lately. I’d really appreciate it if you could just stick on point. The remaining 10% you want everyone to treat as unquestionable dogma, but I am within reason to be skeptical when you refuse to provide any data, refuse to provide any sources, you refuse to respond meaningfully to legitimate counterpoints. You make assertions but you’ve repeatedly fail to back them up. All you do is toot your own horn. We are peers and as much as you resent this, you’re certainly not above the need for data & fact checking. If you put your effort and knowledge into advancing the topic instead of just flaming me, these discussions on osnews would be a whole lot more pleasant and insightful. For real. I really want to keep discussions professional, but I can’t do it alone, throw me a bone. And stop being so eager with the schoolyard insults.
Larger binary code = higher bandwidth requirements on the front end. While you can try and mask this with more caching, you also said x86 needed less cache, so that argument is problematic.
Branch prediction can be done on any ISA. At one point around pentium static branch prediction hints were added to x86, but these haven’t been well received and x86 compilers have mostly done away with them.
https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-branch-prediction-actually-used
Modern software doesn’t use static branch hints and dynamic branch prediction is not intrinsic to an ISA like ARM/x86. The point being the x86 ISA doesn’t have an inherent branch predictor advantage over other ISAs.
Actually I think you missed the point. It’s not just the bandwidth behind the decoder I was referring to, it was the bandwidth in front of it too. The x86 ISA is more difficult to decode, but also more decoding needs to be done on average too.
Just to nitpick, the decoder may represent <5% of die area, but it doesn't follow that it necessarily consumes <5% of the power since that depends on what the CPU is doing.
I’ve said this before: x86 is good enough. but there’s just no reason to pay the higher cost of supporting an x86 ISA over something simpler & more efficient in more modern architectures.
@All I believe it is in this video: https://www.youtube.com/watch?v=AFVDZeg4RVY
Where Dr Ian Cutress asks Jim Keller about the very subject you are debating, and Keller gives a quick dense explanation of why he didn’t think variable length instructions and decode were a big deal. Sadly I just don’t have the time to find the exact clip
Regardless it is a great interview and one well worth watching if you are interested in the low levels of CPUs
One thing to remember about all of this is that the minimum memory fetch for DDR is 64 bits, but could be as much as 256 bits. HBM is 512 bits IIRC. In theory the L1 icache could be anything that is needed. So the odds of needing multiple fetches (the biggest factor in variable instruction word decoding) are almost 0.
So the decode phase can basically have a table of instruction widths, and then decides the slice of the cache that represents the instruction.
Everything after that is more in the domain of cache management than the instruction pipeline
jockm,
I am a fan of Keller. Thank you for bringing him up! He does mention that they’ve mitigated most of the bottlenecks. I agree it is always possible to throw transistors at the problem, but at the same time he’s not a fan of unnecessary complexity and I think he would fundamentally agree that the same transistors could be put to much better use given a cleaner ISA and architecture. There’s no reason that the double length decoder that apple’s M1 is capable of couldn’t be replicated with x86, but then it comes at a greater transistor & energy cost for x86 than it does for ARM. x86 is good enough, but it’s still inherently deficient compared to a cleaner design.
This part of Jim Keller’s interview really resonates with me in terms of my feelings about x86 complexity. (My emphasis)
Here’s another video with Keller sharing his views:
http://www.youtube.com/watch?v=SOTFE7sJY-Q
@jockm,
Thank you for the link. Let me add that to my never-ending queue…. (Using 1.5x speed on YouTube helps keep up with tech talks; and if a section really needs deeper attention hitting “<" will decrease the speed).
@Alfman
https://www.anandtech.com/show/16762/an-anandtech-interview-with-jim-keller-laziest-person-at-tesla
It’s not an issue
jockm,
I already listened to the interview. That’s not quite the full story. He does say the decoder is a small portion of the full CPU die and also says it’s much bigger than adders. I agree with him on both counts and we all agree it becomes easier to build decoders when we have more and more transistors at our disposal, but that does not imply it is efficient. There is always an implied opportunity cost, and he touches on it on this in relation to GPUs that have simpler decoders and can have way more cores and more parallelism as a benefit. He mentions of course that this is irrelevant for 86 cores tasked with running typical C software, which is notoriously sequential. But nevertheless a simpler ISA does have benefits, like a longer decoder as in the case of M1.
He feels x86 has gotten badly bloated, and I completely agree. Ironically that’s part of the reason x86 decoders have become a small percentage of the total die. I don’t think everyone here realizes that this does NOT mean that overhead and bottlenecks are gone. Just to make this point absolutely clear, let’s hypothetically add a preposterous new AI feature extension that adds trillions of transistors to the x86 CPU die and the entirely of the pre-existing subset of x86 occupies a tiny area of the full CPU die. It does not logically follow that the decoder becomes more efficient. Unless the software actually uses these new features occupying the majority of the CPU die, then the decoder bottlenecks are no better off than before at all. They’re merely occupying smaller proportion of the full CPU’s transistors, but those transistors may as well not exist if the software isn’t using them. This is why “die area” is not too meaningful without also looking at load. This is how we need to look at features like MMX/SSE-1-4/AES/etc that get added over time. This added parallelism can give a nice boost for applications that use them, but realistically many workloads don’t. Components that are in the critical path, including the ISA decoder, will be responsible for far more of the latency & energy overhead than those that go unused.
Keller frequently says that only six instructions really matter to the majority of software, and while I don’t know I’d go as far as him, if we were to reduce a CPU down to those instructions we consider critical, we would be left with a fraction of the CPU die (opening up the path to far more cores for the same die area). Logically the simpler & faster the instruction execution is the more pressure there is on the decode unit to provide a rapid stream of new instructions, giving simpler ISAs the advantage. You can always throw more transistors at the problem to compensate, but then you aren’t going to be competitive on energy usage. This may not matter so much for a desktop, but it is a bigger problem for mobile.
@Alfman
Let me turn this around on you: what do you believe is the problem with variable length instructions?
jockm,
x86 instructions can vary from 1 byte to 15 bytes. The first decoded instruction may be straitforward, but all subsequent instructions are dependent on the preceding instructions. We can tackle this statistically but regardless of the approach the circuitry is inherently less efficient to decode than an ISA with regular structure. We have thrown lots of transistors at the problem to improve x86 performance and reduce bottle-necking, but it comes at the cost of consuming more energy and the 80% hit rate can still bottleneck.
Another point is that for all of it’s complexity, the base instruction set for x86 was very limited. In order to extend it intel/amd made use of one or more instruction prefixes to enable extended operand sizes, more registers, etc. While these prefixes work and have enabled x86 to gain new life every generation, their coding is not efficient. And I suspect this is the reason that the x86 binaries I inspected have poorer code density than ARM binaries on average. This means that x86 buys you a less efficient decoder AND less of the program/OS can fit in cache. As always, you can compensate for architectural deficiencies with more transistors, but that carries both an energy and opportunity cost.
Also, despite ARM being more dense than x86 it’s density can still be improved upon. Some ARM instructions have redundant encodings, which is unfortunate, but perhaps in the future we’ll see an ISA that can beat the density of both x86 and ARM when used with typical code.
http://csbio.unc.edu/mcmillan/Comp411F18/Lecture06.pdf
I think modern analytical tools + huge repositories of FOSS software could help us build an optimal ISA for typical software, it would make for very interesting research. Although it would need a very wealthy backer to actually build it and market it.
Also, just so we’re clear I agree that x86 is “good enough” and even necessary given the importance of backwards compatibility. It’s just that if we had the chance to re-engineer it without being tethered to legacy decisions, we wouldn’t want to needlessly repeat x86 complexity. And I think Keller would also agree.
@Alfman
So traditionally the problem with variable length instructions is that each additional byte is an additional memory fetch, which slows down the operation of the system. RISC was created at a time when this was still largely true, so fixed width instructions were an attempt to reduce the number of fetches needed in the instruction pipeline
At this point almost all 32-bit and greater MPUs and CPUs (leaving microcontrollers aside) may look like traditional von neumann architectures from an ISA level but they are really harvard at their core. So they have a instruction cache and a data cache, and code is always executed from the icache (I am sure you know this, but I am just trying to provide a complete narrative). If code isn’t in the icache execution halts until the code is brought in.
Having a harvard architecture means you can do a read from both the icache and dcache at the same time without a penalty, and you can set up a write at the end in the writeback stage that gets completed by the time the next fetch stage happens. Being harvard internally eliminates one of the biggest bottlenecks in the design of a CPU
So as soon as you see the first byte the instruction fetcher basically looks at a state table and figures out how many more bytes are relevant to the instruction and then the decode stage of the pipeline will work on that in the next stage. Because the icache is architected to the needs of the instruction fetch and decode stages there is no penalty to instructions being of variable length.
You could argue that if you need data that isn’t in the icache then this slows down the system, but this is a problem regardless of the instruction word length. The cache controller and branch predictors do their best to prevent this from happening.
There is no overhead to variable length instructions as a result of this. It doesn’t produce the most compact code, and you are right about inefficiencies; but that isn’t what I am talking about when it comes variable length instructions
So no there is no performance penalty to variable length instructions. Yes it takes extra silicon to do it, but nowhere near as much as you might think. This is what keller was talking about when he was referring to “baby computers”. If every LAB in your FPGA counts/transistor on your die counts, then there is a benefit to fixed width.
However that isn’t the case when you are talking about smartphones and up (just to pick an arbitrary dividing line). There is enough spare capacity that you can implement the needed logic and not push out any other functionality
This is the best explanation I can give you. I have never implemented a CPU for production, but I have implemented enough things that analyze and act on bytestreams with cache like structures in FPGAs to understand what keller meant
Unless you have something specific you can point to, we will have to agree to disagree
jockm,
The 8088 could only read one byte at a time, but even as early as the 286 the CPU was fetching 2 bytes at a time. Anyways rest assured that I’m not asserting that x86 has to read instructions one byte at a time
Just to nitpick, reading the first byte of an instruction, can’t, in general, tell you the full length of the instruction. Maybe it should but due to the aforementioned prefixes it’s not necessarily possible. (Unless you count prefixes themselves as instructions, but that would significantly increase latency as prefixes are fetched before the main instruction could be read). So I disagree that you can get away with checking only one byte, probably more like three or four, but I’d have to check reference material to be sure.
This doesn’t make it infeasible, but it does mean you have to repeat the decoding logic at every offset in case there’s an instruction there, or it means you have to add latency to wait on the previous instruction to be sure. Or some kind of compromise where you take a guess and invalidate the results if the guess was wrong. Regardless of the approach though, there’s more complexity/transistors compared to a more predictable ISA.
Yes, I understand the interaction of caches. But I pointed out before this is another area where x86 is penalized for poor instruction density. If you compile a program for ARM and x86, the x86 representation may require more bytes to represent the same program. While the difference isn’t huge, x86 may need more cache to keep up, which gives ARM another advantage.
Ok, but I still disagree “there is no overhead”. It could be possible to have no decode overhead in a trivial ISA, but not in a microarchitecture design where you need at least a cycle of latency after a memory fetch to decode & lookup instructions.
I understand what he was saying, and that’s why I responded with my point about die size. I would argue that “percentage of an arbitrarily large die” is not a meaningful way to defend the complexity of something though. It would be like saying that the IRS tax complexity is trivial and irrelevant because it’s only a tiny fraction of all the government laws on the books. Even if these statements are 100% true, it’s not a good justification for unnecessary complexity in the first place.
That interesting, yes I would agree FPGA would benefit from simpler encodings too.
Mobiles have done away with x86 though. Intel’s fast cores use way too much power for mobile and their atom cores boast bad performance. Yes technically you can do it (I’ve used an x86 tablet) but it doesn’t make much sense given that there’s little need for x86 backwards compatibility and ARM processors have better performance/watt. In fact I kind of feel that we should all be in agreement that the main reason x86 sticks around is because of the importance of backwards compatibility with desktop software. If software compatibility were a non-issue, I really think x86 would have been replaced a long time ago with something simpler and I believe engineers including Keller would be in favor of the change.
Well, I have been pointing to several specific things and I don’t feel we’ve ruled any of them out. But disagreeing is fine with me. Thanks for chatting
When I say no penalty I mean to performance. There is no performance penalty. And it isn’t any faster if you use a fixed instruction word. You pay for it in silicon.
Let me give you a real world case of something I worked on. The client specked that I use a Xilinx Artix-7 FPGA with 23K logic elements. They order so many, it is easier and cheaper for them
I needed about 15K of them for what I was doing. So I looked at what was there and used some of what was left to get more clever with the caching of the datastream to make overall performance a little better. That added no performance penalty and I had the spare capacity for “free” (to me)
However the amount of silicon is trivial compared to the die area needed. In a great many cases
This is directly analogous to what is happening in silicon. It has to be rectangular, and there is going to be lots of unused space that can be used for free. As keller pointed out the amount of silicon you need for variable width instructions is so relatively tiny, the implication is you get to do that part with no real cost to silicon real estate.
Some notes:
1: You keep bringing up x86, please note that I have not. I am only talking about variable length vs fixed length
2: You keep talking about the inefficiency of x86, but that isn’t a problem of variable width per se, but of how x86 evolved. A modern variable length processor could well better design the ISA to achieve better average instruction density than fixed width. The iAPX432 tried to do this with it’s huffman encoded ISA. Like a lot of the 432, it’s ideas were too early, and it’s implementation too old
2: You also put words into my mouth when you said I said the x86 needs more cache. I did not, and please don’t do that. I made no claim about that
jockm,
I still disagree for two reasons. With respect to performance, an ISA that doesn’t require a decoder is obviously going to have less decode latency than one that does. So yes, there IS a performance cost, however given that both ARM and x86 use microarchitecture, they are both paying the additional instruction fetch latency cost, so this one is a wash at least with regards to ARM versus x86 implementations. There are other compelling reasons to have microcode to update CPU glitches and whatnot, but speaking strictly in terms of maximal performance a simpler ISA could benefit performance by eliminating the decoder and it’s associated latency.
Secondly, even with a decoding cycle, variable length instructions are inherently dependent on previous instructions where as fixed length instructions are not. This means you can decode batches of instructions independently without waiting on the result of the previous instruction. This means that whatever logic complexity x86 can manage to fit into a single cycle, a fixed length ISA can decode arbitrarily more instructions in that same cycle. It’s both less complex AND faster at scale.
I keep hearing this argument as though “percentage of something arbitrarily large” is supposed to be meaningful. It does not follow that just because the die is huge, individual subsystems are efficient. This is what I was trying to convey with the IRS example. People might think “bytes and transistors are cheap, lets not worry so much about optimization”, but IMHO this kind of thinking is what has driven software (and possibly hardware) bloat, Even “free” transistors take up power and can become bottlenecks. In the case of the FGPA things are a bit different because the transistors are there whether you use them or not, the architecture is cemented and you’re merely working with it as is. But when you’re designing the chip you’d have the option to use more efficient designs assuming you weren’t tethered to legacy/compatibility constraints.
Would it help if I said I think x86 engineers have done a very good job of optimizing CPU design given the legacy constraints they’ve been dealt?
So does this mean it’s possible that you agree with me about the cons of x86 in particular?
There would be a tradeoff to be had. It’s conceivable but the space savings of variable length instructions would have to be evaluated in the context of end to end latency and energy consumption for the whole stack, decoders, caches and all. I would propose a granularity of at least 2 bytes, because even just halving the number of instruction offsets can double decoder throughput for the same amount of complexity. I think it would also help improve code density over the use of prefixes. I think 4 byte granularity could be more future proof though, you can fit more registers, longer immediate & data pointers, etc, but we need more empirical data to show where the sweet spot is.
Hmm, I feel you’re putting words into my mouth now. I’ve re-read parent post and I don’t think I said that. Could you quote the specific line?
@Alfman
I am going to leave this conversation with a couple of observations:
1. there is no such thing as a CPU without an instruction decoder, one way or another something is decoded
2. Decode all happens within the same fixed period of time, complexity only increases the amount of silicon needed, not latency. The only possible exception is clockless designs which don’t really exist much anymore, and even then decode isn’t where latency happens in what I have seen
3. This study https://ieeexplore.ieee.org/document/5645851 shows that the average x86 instruction based on analyzing real world applications and windows 7 is 2 bytes long. Given that it means that most of the time x86 is achieving higher density than a fixed width 32/64 bit instruction
4: If you ever have the time take a look at this: https://www.fpga4student.com/2017/09/vhdl-code-for-mips-processor.html. It is a 16 bit, integer only, single-cycle, non-pipelined, harvard architecture, cacheless RISC CPU. The only thing that bounds its performance is the speed of the RAM and ROM used. Latency of transistor to transistor propagation is so small it isn’t a factor in its performance; and you could add so much more complexity to each of the stages without worrying about that that it isn’t even funny
I like it because it is one of the smallest actually practical soft cores that is easy to understand.
5: Perhaps I misread, I am not sure, I will just apologize regardless
jockm,
I don’t know why you would think that. It’s possible for an ISA’s instructions to be simple enough for the execution unit to handle directly without a front end decoder. After all not every architecture is based around a micro-architecture.
Transistors have a propagation delay, The more transistors you have in series, the slower a computation HAS TO BE. You cannot use more transistors to overcome sequential propagation delay. And while the propagation is darned fast, every bit of performance is important to satisfy our insatiable appetite for high clock frequencies.So there are only so many sequential operations you can do in a cycle. If our cycle time is optimized for quick operations like adders, then more complex operations have no choice but to bump up into additional cycles. Sequential dependencies are bad, and variable length instructions imply sequential dependencies. Intel managed to decode 4 instructions whereas the m1 decodes 8. Intel could certainly add more transistors to decode 8 in parallel, but those additional instructions may have to be staggered and face a two cycle latency instead of one.
Unfortunately I don’t have access to the article. From the summery it’s not clear whether they counted prefixes as separate instructions, which would greatly bring down the average. But in any case the average instruction length is not what’s at issue. Mathematically speaking an ISA that averages 2 byte but gets less than half the work done compared to an ISA that averages 4 bytes will still be on the loosing end for code density.
For example, we could easily make an ISA where the instruction size is 1 byte (or less) by design. But can we conclude it is denser than x86 or ARM or something better? No, instructions being small isn’t the only thing that matters, work done per instruction matters as well.
This is similar to what we see with PCI evolving to use longer encoding schemes to achieve higher information density for the same number of bits transferred.
https://blogs.keysight.com/blogs/inds.entry.html/2020/03/30/pcie_standards_what-drhs.html
So keep in mind that longer payloads can can potentially improve density over shorter ones, and in my look at Debian software earlier this did seem to be the case between x86 and ARM.
Indeed, I’d love to learn more about FPGAs and get more hands on experience with them. I think it’d be awesome for FPGA’s to be a primary focal point for our next conversation I don’t recall FPGAs ever coming up in article form such that they would be on-topic but maybe you could try submitting an article?
@Alfman
I am about done with OSAlert and doing my best to leave it behind me. The couple of times I submitted articles I didn’t like the experience. I am tired of Thom’s editorializing, his failure to moderate people like HollyB, etc
I am not a huge fan of discord, but it is the one form of contact info I feel comfortable giving out: JockM#5409. Anyone who wants to reach out can start there
jockm,
I hope I didn’t have anything to do with your decision. You’ve been polite and professional. These traits can be hard to come by as I’m sure you’ve witnessed. Despite some of our technical disagreements, not once did you hurl an insult, haha. I for one appreciated your presence and respect your insightfulness.
Yeah, I’m not on there, but I understand that this may not be the best place to find common interests and socialize. I hope you have fun wherever you end up!
@Alfman No you have been fine. Sometimes it feels like you stick with an argument after it has been demonstrated (like this one), and work more on feelings than fact; but who hasn’t been guilty of that sometimes?
Oh I am on reddit as the same.
Please take care.
Sorry not jockm on reddit, Stufflabs
jockm,
Yeah, I’m stubborn, but I do feel like I do an above average job of backing my opinions with data and logical deduction rather than just feelings. But maybe from your perspective it’s different, and I accept that. It irks me when someone makes assertions and doesn’t bother to back them up, I’m thankful you don’t do that
Hey, I have posted on reddit, my one and only post there is somewhat anticlimactic .
reddit.com/user/Levernote/
Intel is a semiconductor fabrication company first, and they produce chips to keep the fabs running.
That is a little like saying that adult’s are what zygotes produce to make more zygotes. You could just as easy say they are a money making company and they produce chips to make money. It’s true on a basic economic level but doesn’t add anything that helps understand the products or the decisions that go into making those chips.
Both of those are true.
It does. People are under the assumption Intel is wed to x86 because they pioneered it, but they are not. They are a fab company first and foremost.
Their fab advantage has been one of their greatest strengths over the years, and that is one of the things people buy when they buy Intel. The fab tech has faltered the last few years, and we’ve seen Intel’s procs take a hit because of it.
Fabs are also incredibly expensive, so Intel finds ways to keep them running. Intel fabs lots of different chips. They have network chips, FPGAs, Arm at one point, NAND at one point, and probably more I’m forgetting.
In contrast, most semiconductor companies contract the fabrication part out. AMD, Nvidia, Qualcomm, Apple, etc.
“Intel is a fab company” says everything that needs to be said about the situation.
Which people? Can you document that claim?
Intel is an original ARM licensee. They were actually the largest ARM vendor for a while.
They are also in talks to produce RISC-V cores.
Intel have always produced other chips as well as x86
AMD stands ready to manufacture Arm chips if needed…. but they released A1100-series in 2016 and found that nobody wanted them and then cancelled plans for their K12 desktop parts.
Intel also intends to produce … anything. They’re trying to open up their fabs to third parties; which means they could end up producing ARM or RISC-V or MIPS or PowerPC if that’s what someone else wants to pay them to produce. Note that Intel used to manufacture their own ARM chips (XScale) but sold all that to Marvell Technology Group in 2006.
The rise of non-x86 architectures will be partially fueled by…. assholes trying to lock consumers into their specific ecosystem (devices built with proprietary “ARM or RISC-V derived” cores, proprietary devices, proprietary DRM schemes, proprietary firmware, proprietary software; with vendor specific stores and/or artificially enhanced planned obsolescence and/or privacy violations and spam as a revenue stream). 80×86 (where hardware manufacturers don’t make software and software developers don’t make hardware) isn’t a good foundation for walled gardens (a greater need for standards/documentation to ensure compatibility/interoperability between pieces from multiple different companies makes 80×86 PCs “asshole resistant”).
I think the closer AMD may get into the ARM ecosystem is by licensing their high microarchitecture to vendors like Samsung who require a high performance core to compete with Apple in the mobile market.
They lack the expertise on the mobile (phone/tablet) market. So I doubt they won’t bother doing a SoC for that space.
AMD scrapped their K12, because there was little value added on using the ARM ISA on the intended application space vs using the x86 equivalent core (what led to Zen). For the datacenter, either you’re a vendor like Amazon/Google who has such scale that makes sense to make your own many-core systems (and just use off the shelf ARM cores with a bit of tweaking). But otherwise, AMD has a great x86 story with EPYC, so again it makes little sense for them to go ARM.
x86 will stay for a very long time on the desktop/laptop and data center. ARM will continue being the choice for the in between markets. And RISC-V will take mostly deeply embedded parts, where licensing costs of the ISA put more pressure than performance, given they will be IoT products that are super sensitive to pricing and very low margin.