Cloudflare, which operates a content delivery network it also uses to provide DDoS protection services for websites, is in the middle of a push to vastly expand its global data center network. CDNs are usually made up of small-footprint nodes, but those nodes need to be in many places around the world.
As it expands, the company is making a big bet on ARM, the emerging alternative to Intel’s x86 processor architecture, which has dominated the data center market for decades.
The money quote from CloudFlare’s CEO:
“We think we’re now at a point where we can go one hundred percent to ARM. In our analysis, we found that even if Intel gave us the chips for free, it would still make sense to switch to ARM, because the power efficiency is so much better.”
Intel and AMD ought to be worried about the future. Very worried. If I were them, I’d start work on serious ARM processors – because they’re already missing out on mobile, and they’re about to start missing out on desktops and servers, too.
It’s honestly not an ARM vs x86 thing…
Really. Intel has made x86 cores (atom) that are both performance and power competitive with ARM. They mostly failed in the marketplace historically, but most of that failure was in mobile. The Xeon C3000 series is very price, performance, and power competitive with anything coming to servers using ARM (on a per core basis).
The real problem isn’t Intel sticking with x86, its that they still haven’t figured out that they package their cores wrong, or haven’t figured out how to package them right at least.
The latest top-end atom based Xeon C series has 16 cores/16 threads, runs at 2.1Ghz with a 32W TDP. The ARM based Cavium ThunderX has 48 cores/48 threads at 2.5Ghz with a 120W TDP. So basically it has 4x the thread count at 4x the TDP, i.e. its mostly a wash from a power point of view.
So why does no one use these chips and instead flock to Cavium?
Density.
You can fit 4 dual socket ThunderX (96 cores) with a terrabyte of ram each into a 2U rack. Thats 384 cores and 4TB of ram. Intel has nothing remotely this dense. The whole thing is probably sucking down 1000W fully loaded, but that is significantly better than most 4 node Xeon 2U servers, and you get 296 extra cores… Even if you take into account hyperthreading (which doesn’t work on all workloads), you still have the ability to run about 200 more threads on a Cavium rack.
Its not ARM being more power efficient, its that Intel isn’t servicing the market that Cavium is – the guys that need the maximum number of threads in the minimum amount of space at low power. It doesn’t matter too much that the Cavium machines are slower on a per thread basis when you get almost double the number of cores per square inch of rack space (at similar power efficiency).
From a technical perspective I see no real reason why Intel couldn’t build similarly dense atom based Xeons (and probably at a lower TDP to boot), they just don’t. I haven’t a clue why at this point.
If they can put 24 high end cores running at 3.4Ghz in a single chip, I don’t understand why they can’t put at least double the number of atoms into one (or more even).
Until they figure out how to do that, they are going to lose customers to ARM, not because of power efficiency, but because of density.
ps. Cloudflare seems to be going with Qualcomm Centriq based ARM servers instead of Cavium, but the basic argument is exactly the same (they are both 48 cores per CPU).
Edited 2018-04-13 23:49 UTC
Perhaps going the ARM path also ensure a better competition, not depending on Intel duopoly (x86 and fab). I think the x86 legacy cost is making us lag behind, despite whatever Atom/Xeon good implementation you may have, you’ll only depend on Intel, perhaps AMD, to deliver performance in a market segment that don’t need to rely on Windows because data servers can run on almost anything, provided they follow some standards.
> From a technical perspective I see no real reason why Intel couldn’t build similarly dense atom based Xeons
They can’t get the SMP scale out on a single die to work well enough. Even AMD’s Ryzen/EPYC line was a game changer for x86 due to how many threads it sticks on one chip. ARM chip vendors don’t have coming up on 40 years of IBM PC history weighing them down with extra silicon, so they’re free to build smaller cores in more novel configurations.
Except that, as the article indicates :
^aEURoeEvery request that comes in to Cloudflare is independent of every other request, so what we really need is as many cores per Watt as we can possibly get,^aEUR
It is not really SMP, or it is an easy form with very little data sharing between cores. Maintaining coherency between tens or hundreds of cores is very power consuming and inefficient, you need busses with lots of coherency traffic, many-ported large caches, coherency adds latency…
Of course, the arguably simpler ARM architecture compared to x86 and the many cores available (proprietary from Apple, Qualcomm and others or from the ARM catalog) allows lots of flexibility.
Cloudflare may even one day ask for custom CPUs, with more networking interfaces, minimal floating point performance, some special accelerator for their niche…
Treza,
Obviously shared state is a bottleneck. SMP quickly reaches diminishing returns. NUMA is more scalable, but it is harder for software to use NUMA effectively if the software was designed with SMP in mind.
I think CPU architectures with explicit IO rather than implicit coherency could increase the hardware performance, especially with good compiler support, but it would require new software algorithms and break compatibility so it would be unlikely to succeed in the market.
I think the way forward will be hundreds & thousands of independent cores like you say that will function more like a cluster of nodes than SMP cores with shared memory.
I can see such a system benefiting from a very high speed interconnect which will serve a similar function to ethernet but will offer much faster and more efficient IO between nodes. Since a fully connected mesh becomes less feasible at high core counts, we’ll likely see more software algorithms evolving to support native (high performance) mesh topologies. Most of these algorithms will be abstracted behind libraries. For example we’ll probably see sharded database servers that expose familiar interfaces but distribute and reconstruct data across the mesh in record speeds.
I for one am excited by the prospects of such highly scalable servers!
Edited 2018-04-15 03:22 UTC
And let’s call these massively parallel architectures, with huge memory bandwidth, hundreds of cores and multithreading to hide memory latency…
GPGPUs !!!
Treza,
Obviously GPGPUs have their uses, but they target different kinds of problems. For cloudflair’s example of web hosting, a massively parallel GPGPU isn’t very useful, but massively parallel cluster is.
In the long term, FPGAs could eventually unify GPUs and CPUs so that we no longer have to consider them different beasts for different workloads. Instead of compiling down to a fixed instruction set architecture, software can be compiled directly into transistor logic.
I’m not happy with the price of GPUs these days, so I think there may be an opportunity for FPGAs to grow out of a niche status to become more of a commodity. However, IMHO, it will be many years before software toolchains are actually ready to target FPGA. What we have is a sort of chicken and egg problem.
FPGA toolchains are so proprietary they make Microsoft look like Richard Stallman. That has to change before they can get any real use in general computation.
tidux,
Yea, I’m pretty sure this could be addressed by FOSS projects, but obviously we’re not there yet. If the industry wants to fight FOSS, that would be a shame and it might well hurt access especially for smaller developers.
1. I wonder if projects like RISC-V can help create a better & more open ecosystem for FPGAs.
2. Aren’t FPGAs far to slow to replace normal CPU or GPU ?
FPGAs are slower than CPUs if you make them emulate full CPUs, but they can accelerate certain things faster than GPUs. Just look at crypto mining. It went CPU -> GPU -> FPGA -> ASIC. FPGAs are significantly cheaper than fabbing your own ASIC for everything.
Lennie,
Not to overgeneralize, but I basically agree with this. GPUs are highly optimized for the vector tasks they encounter in graphics, such as simultaneously performing the exact same operation on every element. But frequently real world algorithms have “if X then Y else Z” logic, in which case the GPU has to process the vector in two or three separate passes to process X, Y and Z. More complex algorithms can result in more GPU inefficiencies. There’s still merit in using a GPU versus a CPU due to the sheer amount of parallelism. However the multiple passes represent an inefficiency compared to an FPGA that can be virtually rewired to handle the logic in one pass.
To elaborate on what you were saying, an FPGA that emulates a CPU architecture to run software is not going to perform as well as an ASIC dedicated to running that CPU architecture:
software -> machine code -> ASIC processor = faster
software -> machine code -> FPGA processor = slower
While FPGA potentially gives us some interesting options for building processors at home, the software isn’t taking advantage of the FPGA’s programmable capabilities. In other words, the FPGA is being used in a way that isn’t optimized for the software running on it. Consider how FPGAs are meant to be used:
software -> FPGA logic -> FPGA
Assuming the problem has a lot of parallelism and the compiler is any good, then this should be significantly faster than a traditional processor stepping through sequential machine code algorithms.
An ASIC is always going to win any performance contest:
software -> ASIC logic -> ASIC
…but until we have fab technology that can somehow cheaply manufacture ASICs at home, FPGAs are the more interesting option for software developers
Edited 2018-04-16 20:51 UTC
There is project IceStorm which targets Lattice (although low-end FPGAs):
http://www.clifford.at/icestorm
It’s probably not yet ready for a datacenter but it’s a start.
I am currently playing with it (on Olimex OpenHW board) and it can transform your Verilog code to the FPGA blinky
If someone is interesed, here are IceStorm tools packaged for FreeBSD:
https://github.com/thefallenidealist/ports_FreeBSD/tree/master/devel
JohnnyO,
Thank you for linking that info. I’m definitely interested in these things and I’d like to get in touch with you over email, I wish we could exchange contact information without publishing it… that’s something osnews lacks.
I wonder why ~Amiga operating systems don’t try to integrate FPGAs into their systems – it would be very in Amiga spirit/tradition of a custom chipset …but “on-demand” one! And ~Amiga would lead again instead of badly trailing behind…
PS. Don’t you have a homepage with contact info? (hm, me neither …though planning one for a long time)
zima,
Well, I did, but that was eons ago and I never put it back up after switching hosting.
I’ve asked Thom before to give out my info to another user, but I think he ignored my request. I don’t even remember who is was anymore, was it you? Haha.
IIRC it wasn’t me (I think I wouldn’t mind giving out email adress in comments…); but maybe I also don’t remember.
And W8, aren’t you the one providing hosting? …reminds me of a local ~saying: ~”the shoemaker walks barefoot”
(why does OSAlert seem to change all quoted emojis to “;)”?…)
Not sure if sarcasm or not, but the second link is my personal GitHub page. There should an mail address in my personal repos LICENCE files.
Disclaimer: I am not the person which made the OSS FPGA toolchain possible, but I am interested in it (trying to learn FPGAs for fun).
So… what it’s good for?
Well, I am far from FPGA expert, but for me it is good for learning another computation “mindset” because I am not familiar with HW synthesis. Also IceStorm taught me that it is possible to learn something on that level without downloading multiple GB IDE and with using tools with I am familiar (editor & shell & makefile)
Wild guess: Implementing some new crypto/acceleration/offload algorithm without changing (too much) your SW stack.
ARM core != Xeon cores so counting cores are metrics is rather useless. (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).
More ever, a Supermicro Big Twin (4 x 2S x Xeon Gold 6152) can pack 160 cores and 320 threads and 6TB RAM in 2U. (224 cores / 448 threads and 12TB RAM if opt for the far more expensive Xeon Platinum 6176/6180Ms) and should be ~2-3 faster (again, YMMV) compared to a Cavium based machine.
Now, I’ve added the YMMV a couple of time, and for a good reason.
ARM has two advantages (and density is *not* one of them).
1. Price per transaction. Intel Xeon price, especially the high end parts and the M parts, is unreasonable. AMD might be able to pull another Opteron and force Intel to lower the price, but that remains to be seen.
2. Power per transaction. ARM cores are more efficient. If your application requires a lot of slow threads and you have limited power budget, ARM is definitely the answer.
– Gilboa
Edited 2018-04-15 08:52 UTC
So what ^aEURoehigh-end^aEUR ARM did you tested and how?
Do you have an experience with Centriq or ThunderX2?
ThunderX was really weak.
We plan to test ThunderX2 when we have some free time (and when its freely available).
Please note that our proprietary application is heavily CPU/cache/memory bandwidth limited and has zero acceleration potential, so (even) ThunderX2 limited inter-core/CPU interconnect bandwidth might be major performance handicap.
– Gilboa
Edited 2018-04-16 06:53 UTC
This is definitely a sign of non muilticore-friendly workload/programming practices.
Edited 2018-04-17 03:25 UTC
I was using atom based Xeons in my example. Why are you bring up machines that literally cost 10x-15x as much and use many times as much power? My whole post was about competing with ARM – atom based Xeons compete with ARM (or at least try to). High end Xeons cost way too much, use too much power, etc. – it isn’t the same market at all.
So let me clarify… I thought the context was obvious in my post, but maybe not. Intel has nothing remotely as dense as Cavium/Centriq with competitive power/core and cost/core. My argument is simply that they could if they wanted to using atom cores – they don’t need to switch to ARM to compete…
You talked about Density which usually translates to MIPS per U.
You claimed that Intel has nothing remotely close (your words, not mine) to ARMs density.
I proved otherwise.
A yet-to-be released high end Cavium Thunder X2 based solutions can “shove” 2 x 48 x 4 (384 cores) in 2U and require ~190w per socket.
An already shipping Intel Xeon Platinum based solution can pack 224 fast cores (448 threads) in 2U and require ~165w per socket (205w if you go super-high-end).
An already shipping AMD Eypc based solution can packet 256 cores (512 threads) in and require 180w per socket.
As this product is still soft launched, pricing information is not available and if ThunderX 1 is any indication, pricing will be ~40-50% of a comparable AMD/Intel based solution (A far cry from your 10-15x claim).
– Gilboa
Edited 2018-04-16 08:18 UTC
Xeon 8180 is a $11k chip. ThunderX2 is (at most) a $2k chip – pricing info is still hard to find but is likely about the same as ThunderX (which was around $800).
https://www.anandtech.com/show/10353/investigating-cavium-thunderx-4…
Thats $90k vs $12k just on the CPUs alone. Cavium motherboards will obviously be far cheaper (its a SOC so they are far simpler) and cooling/power components will be cheaper as well. Rest of the components are irrelevant as they are not platform specific for the most part.
10x-15x could be a bit of an overstatement, but its still at least 5x-10x cheaper to go with Cavium (and far lower power usage on a per thread basis), and if they are really pricing them the same as the ThunderX (say $1k) the difference really is 10x-15x…
As far as performance goes, I think your missing the point. If your running a bunch of redis/memcache instances you don’t want all that performance – its a waste of silicon. You just want a shit ton of cores with a bunch of cheap memory hanging off of them that occupy as little rack space as possible and use minimal power… This is exactly the kind of thing ARM/Atom are good for.
Why on earth would anyone buy a Xeon Platinum to do this? I’m not arguing that that high-end Xeon’s are bad (hell, they are awesome!) – I’m arguing that low-end Xeons (atom based ones) are bad. They are simply built the wrong way to compete in the market they would actually be competitive in. Its not because they are too slow, and its not because they are too power hungry, its because they are not dense enough for the market they should be targeting…
The market Cavium primarly targets doesn’t care about MIPs/U, they care about threads/U. Latency is all that matters…
Edited 2018-04-16 16:20 UTC
Opteron A isn’t serious?