The best hardware to build with Swift is not what you think

Thom Holwerda 2017-03-15 Apple 36 Comments

Some interesting figures from LinkedIn, who benchmark the compiling times of their Swift-based iOS application. You’d think the Mac Pro would deliver the fastest compiles, but as it turns out – that’s not quite true.

As you can see, 12-core MacPro is indeed the slowest machine to build our code with Swift, and going from the default 24 jobs setting down to only 5 threads improves compilation time by 23%. Due to this, even a 2-core Mac Mini ($1,399.00) builds faster than the 12-cores Mac Pro ($6,999.00).

As Steven Troughton-Smith notes on Twitter – “People suggested that the Mac Pro is necessary because devs need more cores; maybe we just need better compilers? There’s no point even theorizing about a 24-core iMac Pro if a 4-core MBP or mini will beat it at compiling.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

36 Comments

2017-03-16 3:49 am

Alfman verbose=1
Thom Holwerda,

As Steven Troughton-Smith notes on Twitter – “People suggested that the Mac Pro is necessary because devs need more cores; maybe we just need better compilers? There’s no point even theorizing about a 24-core iMac Pro if a 4-core MBP or mini will beat it at compiling.”

You know, I’ve said it before, unless you have a highly localized CPU bound process, you’re usually better off with the faster clock rate.

Ironically the “higher end” systems with more cores often perform worse because the law diminishing returns kicks in at a relatively low core count whereas CPU speed is proportional to clock speed (if you look at the clock speeds on those MacPros you’ll see they are lower).

People like the idea that moores law can continue by replacing clock speed gains with core count gains, but generic software does not scale well that way.

2017-03-16 6:27 am

Earl C Pottinger
I write code to run multi-threaded in Haiku. and one of the two things I have to struggle with is to get the inner loops to fit inside the L1 cache of each CPU so the threads run as fast as possible.

What I find is a harder problem, and sometimes unsolvable is the data access of the different threads, it is rare that the threads want to work on the same data at the same time resulting in each thread invalidate the L2 and L3 caches.

There is a way to lock certain data into the cache but I don’t think it helps with my programs. I just wish I had large caches in my I7.

Edited 2017-03-16 06:32 UTC

2017-03-16 11:16 am

mojmir
wouldn’t it be marvelous to have on-chip addressable local memory to be under control of programmer, like the 256KB local storage in Cell (ps3)?

2017-03-16 11:36 am

Kroc
The XBONE has 32MB of EDRAM directly on the CPU die, and Intel make chips with 128 MB of EDRAM. I think this is the way things will slowly go as having some memory that’s thousands of times faster than system RAM, but isn’t CPU-controlled cache, is a massive win for software that can use that space intelligently; games are a very good place for that understanding to be born.

2017-03-16 1:22 pm

Anonymous
Back in the day RAM was faster than the CPU. This is part of the reason (along with chip complexity) why the 8-bit CPUs had so few registers, there wasn’t really a penalty for going out to RAM.

The ultimate expression of this was the TMS9900 CPUs (used in the TI-99/4). It was a 16 bit CPU with 16 registers… that were all stored in RAM. IIRC only the instruction counter and flags, and the address of the register window, were kept internally to the chip.

2017-03-16 2:41 pm

CodeMonkey
wouldn’t it be marvelous to have on-chip addressable local memory to be under control of programmer, like the 256KB local storage in Cell (ps3)?

That’s essentially what you have on the new Knights Landing Xeon Phi CPUs. There’s 16GB of on-die RAM (MCDRAM). The chip can be configured such that the MCDRAM is transparently used as another cache level, in which case it’s not under control of the programmer but instead managed by the memory controller, or you can configure it as a distinctly separate NUMA domain (or 4 domains) in which case it is directly addressible by the programmer in thier application.

2017-03-16 2:44 pm

Alfman verbose=1
Earl C Pottinger,

I write code to run multi-threaded in Haiku. and one of the two things I have to struggle with is to get the inner loops to fit inside the L1 cache of each CPU so the threads run as fast as possible. [/q]

Wow, most people have just given up trying to optimize software at that level, but you are right it does make a big difference if you pull it off.

[q]What I find is a harder problem, and sometimes unsolvable is the data access of the different threads, it is rare that the threads want to work on the same data at the same time resulting in each thread invalidate the L2 and L3 caches.

Once you leave the local cache obviously the higher level cache & memory bandwidth becomes the primary bottleneck. We can keep adding dozens of cores, but this just exacerbates the contention of the shared bus between them. The cost of overhead will greatly negate the benefit of adding the new cores.

Even when threads are “embarrasingly parallel” and do not share the same data and have no fine grained dependencies on one another, the shared memory bottleneck kills performance for them. So in the long run the only viable way to achieve scalability through hundreds of cores is with NUMA (non-uniform memory architecture) typologies. This way each independent process can run at full speed without any degradation from neighboring processes.

I think NUMA is great for VMs and running many daemon instances simultaneously, but alot of modern multithreaded code is designed around the basic assumption that data structures can be efficiently shared by all threads, which is a bad assumption for NUMA systems intending to scale to thousands of cores. Long term scalability requires us as developers to treat cores less like a shared memory architecture and more like a cluster.

Of course, the real question might be why we would need hundreds of cores anyways, haha. If we had them, we could each have our own personal IBM watson (which ran on 90 nodes with 720 total physical cores), I’m sure there’s some innovation to be had.

You could probably have a system that writes/directs/orchestrates/acts/renders a Hollywood quality movie in real time. Instead of “searching” for a movie as one does on netflix, a user could enter their personal preferences and a movie would be custom generated to their specs in real time! Friends would end up sharing their favorite creations much the way they share minecraft world seeds.

2017-03-16 5:41 pm

Earl C Pottinger
I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it’s own local memory pool.

When accessing memory that is in it’s pool the access is very fast, if the access is outside it’s pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.

Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.

Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.

2017-03-16 7:03 pm

Megol
I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it’s own local memory pool.

[/q]

AKA NUMA (Non-Uniform Memory Access). Standard since a long time even for consumer computers.

When accessing memory that is in it’s pool the access is very fast, if the access is outside it’s pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.

You mean something like COMA (Cache Only Memory Access)? Otherwise the standard cache coherency system does the equivalent: memory is either in a fixed location (DRAM), in one processors cache or in several processors caches (shared, only one have write access).

Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.

Yes.

[q]

Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.

While you may be ironic (hard to tell on the Internet) this is how computers do it already.

But IMHO message-passing is the future. That’s actually what the cache coherency protocols do however hidden as memory accesses.
2017-03-16 7:17 pm

Alfman verbose=1
Megal,

But IMHO message-passing is the future. That’s actually what the cache coherency protocols do however hidden as memory accesses.

I agree with you on message passing, it can work with both multi core processors and with extremely large clusters. Time will show it to be the most scalable solution.
2017-03-16 7:04 pm

Alfman verbose=1
Earl C Pottinger,

I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it’s own local memory pool.

When accessing memory that is in it’s pool the access is very fast, if the access is outside it’s pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.

Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.

Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.

I’ve thought about it before. It seems feasible to measure the access patterns to see exactly what the bottlenecks are, but to actually compensate for it without actually correcting the underlying code seems infeasible to me.

At a fundamental level we need our data structures and logic paths to be more parallel friendly throughout the program to remove synchronous dependencies, which is something that currently requires a dedicated programmer to do. Maybe some day we’ll have code optimizers that see what the programmer was trying to do and automatically generate new code better suited for distributed NUMA topologies, but it doesn’t seem we’re there yet. The main difficultly is that the optimizer doesn’t know how to differentiate between behaviors the programmers really intended versus those that are merely coincidental.

Let’s say there’s a program that connects to a server, reads a file, transmits chunks of the file to the server in a loop, writes a file, and then exists in that sequence. A computer has to assume all of these are deliberate and is not free to optimize anything that would change the semantics. Theoretically it might have been possible to get more parallelization by opening/reading/writing the file while the connection was being made, but the computer doesn’t know that because the sequential ordering is implied by the code even if it isn’t critical. We can’t know if the programmer intended for the file write to be absolutely dependent/synchronized with the socket IO. Perhaps it’s a deliberate log/audit record, in which case the synchronization is important, or perhaps it’s an unrelated process that just happens to be specified sequentially because that was the easiest way for the programmer to implement it.

A game daemon might send/receive messages with many UDP clients in a specific access pattern. In all likelihood the ordering is just coincidental based on the arbitrary algorithms chosen by the programmer, yet the optimizer is not free to replace the sequential algorithm with a new parallel algorithm with no inter-dependencies because it has no way to know whether the packet dependencies were intentional.

So even if we had a very smart optimizer that’s capable of transforming serial programs to parallel automatically, it’s going to be highly constrained to follow the unintentional semantics of the original programmer’s code that specifies thousands of instances of inadvertent serialization, which is the enemy of scalability.

Now maybe there could be some AI that could use fuzzy analysis to determine if a serial construction was intentional or not (ie execute A + B + C in parallel even though the original code said to execute them in order), but if the AI makes a wrong judgement then it opens up the possibility that new subtle bugs will crop through like an application confirming a transaction while the transaction is still being processed in parallel.

For all these reasons, I think upgrading to massively parallel architectures is not going to be something we can solve just with hardware…we’re going to have to rewrite the software too.

Edited 2017-03-16 19:09 UTC

2017-03-17 4:19 am

ssokolow (Hey, OSAlert, U2F/WebAuthn is broken on Firefox!)
Wow, most people have just given up trying to optimize software at that level, but you are right it does make a big difference if you pull it off.

I think a lot of us just felt the trade-off was wrong.

For the longest time, I worked in Python because I was goal-oriented and my experiences with C and C++ had taught me that “writing your own CPU-bound algorithms simply isn’t worth the stress of getting them right”.

When Rust came around, I got really excited about things like the ability to be much more certain I’d handled every error case thanks to monadic error-handling and the ability to encode all sorts of invariants in the type system, thanks to having a compile phase to eliminate the abstraction penalty I was used to.

That led into the focus on zero-cost abstractions, which caused me to start to re-evaluate my conclusion that “reliable + CPU-bound is simply too hard for me, so I’ll take reliable and limit myself to other people’s C extensions”.

I was then introduced, by the Rust community, to things like “code::dive conference 2014 – Scott Meyers: Cpu Caches and Why You Care” and started to get really excited. (Come for the safety, stay for the performance)

By that point, my hopes had been raised enough that I searched around and discovered that Valgrind+KCachegrind can be used to profile CPU cache misses.

I’m getting rather psyched about trying that kind of optimization myself. (I’m also really psyched about how easy cargo-fuzz makes fuzzing Rust programs.)

Edited 2017-03-17 04:22 UTC

2017-03-17 7:06 am

Alfman verbose=1
ssokolow,

For the longest time, I worked in Python because I was goal-oriented and my experiences with C and C++ had taught me that “writing your own CPU-bound algorithms simply isn’t worth the stress of getting them right”. [/q]

Some here, I actually like many aspects of low level programming, but over the years I’ve just come to hate debugging and corruption in unsafe code, it’s time that I’d rather spend doing something else (not to mention my many other peeves with C++).

[q]When Rust came around, I got really excited about things like the ability to be much more certain I’d handled every error case thanks to monadic error-handling and the ability to encode all sorts of invariants in the type system, thanks to having a compile phase to eliminate the abstraction penalty I was used to.

That led into the focus on zero-cost abstractions, which caused me to start to re-evaluate my conclusion that “reliable + CPU-bound is simply too hard for me, so I’ll take reliable and limit myself to other people’s C extensions”.

I’m quite enthusiastic about it too and I think devs who’ve moved to managed languages in frustration of unsafe code should give rust a look.

2017-03-16 4:14 pm

osvil
Note that depending on the CPU, L2 may be per-core while sharing happens at L3.

In any case, if you can get everything optimized to run from L1 you get that extra speed.. However, in many cases (at least the ones I am tackling) you still have the bottleneck to feed/output the data.

People is just not aware of how fast the processors we use to day actually are.

2017-03-16 6:43 am

oiaohm
http://www.phoronix.com/scan.php?page=article&item=amd-ryzen-1700&n…

The reality is law of diminishing returns is not straight forwards as the ryzen proves.

People like the idea that moores law can continue by replacing clock speed gains with core count gains, but generic software does not scale well that way.

Some generic software scales in ways that are suitable for more cores. But a lot of software does not. So this really should be “most generic software does not scale well that way”.

Lot of software development for Linux is accelerated by more cores but that is because the build system is designed in lot of the cases to exploit multi cores. So OS X development system is not designed to exploit multi cores to the best.
2017-03-16 2:40 pm

Bill Shooter of Bul Platinum Prime
Well, the article says they did benchmark on the two systems and at one point they found the mac pro was better, and it seemed like more cores was better than higher clock speed.

Apple updated xcode, and now there are wonky results detailed in the article.
2017-03-17 5:27 pm

Flatland_Spider
Ha. Yeah.

Guess what is highly linear? Compiling code.

Guess what is highly linear and easy to cache? Compiling code.

Guess what is highly linear, easy to cache, and has a high possibility of breaking when parallelized? Compiling code.

There are caveats, but in general….

I’ve had this fight with devs before. What most devs really need is a moderate amount of really fast cores because most of the time the changes are small, and if a build systems is in use, only the changes will be recompiled.

Most devs are application developers with low code churn. Some developers, like kernel devs, do deal with a code base with lots of churn which makes caching hard, and they can take advantage of lots of cores.

2017-03-16 3:51 am

unix_joe
I don’t have a LinkedIn so I can’t see the results, but my comment based on summary is this: That is not a $1399 computer. Mac Mini hasn’t been updated in 2+ years. 2012 i7 was faster than the 2014 i7.

So chances are, a sub-$1000 5 year old machine is better than a current Mac Pro for Swift compilation.

2017-03-16 12:35 pm

laffer1
A Mac pro is a 2013 era machine too. It hasn’t been refreshed. The other issue here is that depending on the setup of their software, the compiler can’t optimize well. If they have most of there code in only a few files, it can’t be done in parallel easily.

If you test compiling something huge like say the linux kernel or FreeBSD, you can see parallel builds are faster up to the point you’re I/O blocked.

It’s certainly possible the swift front end for the compiler is not optimized for parallel builds too.

All we’ve proved is that yesteryear apple tech is slow. If tim cook would only refresh the mac line like he’s supposed to. No one sent him the memo that tech companies have short upgrade cycles.

2017-03-16 5:22 pm

henderson101
Agreed, I was just looking down the comments to see if anyone else had noted that the Mac Pro is woefully underpowered. I believe the high end iMacs have more processor power than the Mac Pro. Marco Arment is forever whining about it on ATP. I think he asserted this week that the current iPhone 7 has a better GeekBench score in single threaded mode than a Mac Pro.

2017-03-16 5:22 pm

Bill Shooter of Bul Platinum Prime
Interesting, I also don’t have a linked in, but I can view the results just fine. Maybe geofencing?

2017-03-17 8:21 pm

unix_joe
It wants me to register, both here and at work. I left LinkedIn years ago after they pulled a stunt that tricked people into providing their email passwords so that LinkedIn could read all the contacts and spam them.

2017-03-16 5:39 am

unclefester
An Apple engineer discusses the problem of running multi-core application on Macs.

https://www.quora.com/Is-it-the-OS-that-handles-multi-core-use-or-do…
2017-03-16 6:56 pm

Bill Shooter of Bul Platinum Prime
They have a swift compiler for Linux, I wonder if it demonstrates the same issues. Obviously, you couldn’t compile ios/mac os applications, but if its really xcode and not the OS you should be able to reproduce it with a good sample app.

2017-03-17 2:22 am

unclefester
They have a swift compiler for Linux, I wonder if it demonstrates the same issues. Obviously, you couldn’t compile ios/mac os applications, but if its really xcode and not the OS you should be able to reproduce it with a good sample app.

It is an OSX/hardware problem according to the above link. Macs default to a power saving mode which disables cores.

2017-03-17 2:04 pm

Bill Shooter of Bul Platinum Prime
Hmm, Read the article a few times. They did make that assertion, but I don’t completely understand why has more of an affect on the Mac Pro than the Mac Book pro. I mean, the Mac pro really has little reason to aggressively conserve energy, and it has more cores. But then again its a bug, not an intended feature.

2017-03-16 7:57 pm

theTSF
The problem we have today is Processor speed more or less peaked, we make it up by going parallel with multiple cores. However most developers develop software with a single thread in mind, so a Computer with 24 cores will perform most task as well as one with 4. Because (over simplified)

Core 1 – OS

Core 2 – IDE

Core 3 – Browser

Core 4 – Compiling

When you get a lot of cores there will just be a lot of cores being idle. Having a lot of cores makes sense for jobs that are happening on many threads working together or are taking a lot of requests at the same time.

However most programs are based on Top Down development. So one thing happens then the next. SO the advantage of multiple cores cuts down rather fast.

2017-03-16 8:19 pm

Alfman verbose=1
theTSF,

When you get a lot of cores there will just be a lot of cores being idle. Having a lot of cores makes sense for jobs that are happening on many threads working together or are taking a lot of requests at the same time.

However most programs are based on Top Down development. So one thing happens then the next. SO the advantage of multiple cores cuts down rather fast.

I agree with your points, even though many of us acknowledge the limits of sequential programming we do it anyways. We’re going to have to make some big changes to take advantage of parallelism. We keep assuming this change is coming down the pipeline, but it never arrives. Even though this is impeding progress, it might not happen… much like IPv6. (Now let’s duck for cover, haha).
2017-03-16 10:56 pm

tidux
This is part of why I enjoy my current Linux setup. Other than Firefox, most of the processes I use are small and only depend on shared network or filesystem features to work together.

* mail sync is separate from mail reading

* music playback is separate from the UI

* most things live in terminals unless they’re innately tied to graphics by their nature (media viewers, GIMP, etc.)

* software building is usually done through gcc, g++, or go which all scale beautifully to the 16 threads I have available

* VMs can have a few cores and a chunk of RAM carved off for themselves without impacting the rest of the system
2017-03-17 5:53 pm

Flatland_Spider
Most desktop workflows are stupidly linear, and we can’t do much about that because humans. Multithreading vim only gets me so much.

It’s nice to say “developers develop software with single thread in mind”, but then there is reality. Only so many workloads can be parallelized, and only so much concurreny can be squeezed out of problems. The work doesn’t scale because the data can’t be broken down enough due to interdependence or the set isn’t big enough. Especially end user stuff.

Servers are a different story because they have a lot more going on. Provided the language isn’t a barrier to parallelization. There are lots of opportunities to use many cores since a lot of the work has low interdependence between requests.

2017-03-17 6:16 pm

Flatland_Spider
maybe we just need better compilers?

If you had to point at something, I would guess it would be the preprocessor, which is not the compiler. It probably could do a better job of breaking files into components to reduce what needs to be compiled.

There are a lot of steps involved in turning text files into machine code, so blaming the compiler is a little disingenuous. We never really think about them, but build systems do some heroic stuff to detangle code bases to only recompile needed files. Then there is the compiler toolchain of preprocessor, compiler, and linker.

We have tools that paper over a lot of details in the compilation process, but some serious work has been put into those tools.
2017-03-18 2:06 am

Earl C Pottinger
In-fact I sometimes just think for days even weeks about how to solve a problem by making it multi-threaded.

Two things have made me do this.

(1) I wrote an digital/analog scope program for Haiku, each function/control ended up having a separate thread, before I knew it I had a multi-thread program that not only was fast and smooth to use. I now want all my programs to be like that.

(2) While writing a program to search a large hash-space I ended up with a single threaded program that took an entire weekend (54 hours) to run. That meant every time I tried the improve the hashing parameters I had wait two+ days to get my test data. Finally I looked at how I handled the search and the tests, I broke the searches in multiple range search threads and then had those pass hashes that passed the first tests to a final thread that did the rest of tests. Running as 16 threads plus the final testing thread the runs took less than 2 hours each.

However note. to get that speed-up I have to realize a whole new way to process the data.

Multi-threading is hard not because it is hard to do, but because you have forget the old methods to solve a problem and think of brand new methods that have very little to do with how the old methods worked.

2017-03-18 2:26 am

Alfman verbose=1
Earl C Pottinger,

(2) While writing a program to search a large hash-space I ended up with a single threaded program that took an entire weekend (54 hours) to run. That meant every time I tried the improve the hashing parameters I had wait two+ days to get my test data. Finally I looked at how I handled the search and the tests, I broke the searches in multiple range search threads and then had those pass hashes that passed the first tests to a final thread that did the rest of tests. Running as 16 threads plus the final testing thread the runs took less than 2 hours each.

For the sake of discussion, do you mind sharing the problem and requirements you were trying to solve? The thing about using hashes for search is that’s it’s usually very fast (ie O(1)). A binary search is O(log(N)) and even that should converge fairly quickly. What exactly were you searching (assuming you don’t mind my asking)? Do you mean iterating a hash over values like bitboin?

2017-03-19 4:39 am

Earl C Pottinger
Hello, I don’t think you will find my use useful.

What I needed were a set of hash value to fill a S[BOX] where the values in the S[BOX] when any two values XOR together would not generate a value that was also in the S[BOX]. Add on the factor that the values were also doing a cit rotation thru a 32-64 bit hash value and I had to do a lot of testing for each value.

It did not end there I also wanted each hash value to flip a min. number of bits so I also needed to do bit counting for each tested value.

I hope that is clear.

2017-03-19 5:54 pm

Alfman verbose=1
Earl C Pottinger,

I hope that is clear.

So you were effectively creating your own hash and iterating over it to test some properties then? That makes more sense, it’s not really what I was picturing, haha.

Sounds fun. I assume you were searching a 32bit space then? Brute forcing a 64bit hash is considerably harder, 128bits probably impossible.

I’ve implemented standard crypto hash algorithms but I’ve never designed my own. I’ve designed some random number generators, which are similar, but that was to initialize random motion vectors, not really anything dependent on strong hashing requirements.

2017-03-20 2:55 am

Earl C Pottinger
The 32 bit version I tested all possible values, I did not do a true 64 version just used 64 variables.

I was playing around with 48 bit hashes, and testing random() numbers. It did work, and the larger search space found valid hashes faster. But the original 32 bit hashes worked fine as I found over two million valid numbers and I only needed 256*4096 values (enough to hash an entire disk data block).

What was fun was when I realized how to do it in a multi-threaded way by throwing out my old assumptions.