GCC 4.6, LLVM/Clang 2.9, DragonEgg Benchmarks

Guest post by Michael 2011-03-29 Benchmarks 10 Comments

“Version 4.6 of GCC was released over the weekend with a multitude of improvements and version 2.9 of the Low-Level Virtual Machine is due out in early April with its share of improvements. How though do these two leading open-source compilers compare? In this article we are providing benchmarks of GCC 4.5.2, GCC 4.6.0, DragonEgg with LLVM 2.9, and Clang with LLVM 2.9 across five distinct AMD/Intel systems to see how the compiler performance compares.”

10 Comments

2011-03-30 12:11 am

copx
I always wanted to try clang, but there are no Windows binaries. Are there any plans to fully support Windows in future?

2011-03-30 8:42 am

phreck
There have been windows binaries since 1.8, provided for MinGW. And if you would have typed “llvm windows”, then right the first result would have been a howto on llvm in MSVC.

2011-03-30 9:19 am

copx
There have been windows binaries since 1.8, provided for MinGW.

No. There are LLVM binaries. LLVM and clang are two different beasts. clang uses LLVM but LLVM can also be used with GCC – those binaries are LLVM-GCC binaries, not clang. There are no clang binaries for Windows on the official site and further research has revealed that basically there is no support. Nobody in the dev team uses or even just builds and tests on Windows. There are Windows makefiles which may or may not work for any given version and basically things are expected to crash and burn on that platform because there is zero testing. Not really surprising given that this is an Apple sponsored project

And if you would have typed “llvm windows”, then right the first result would have been a howto on llvm in MSVC.

I knew about that site just like I knew about the LLVM-GCC binaries. I asked for clang binaries and real Windows support.

2011-03-30 4:25 pm

ba1l
There aren’t any binaries that I could see, but Clang apparently does work on Windows. You’d just have to build it yourself, using either Visual Studio (don’t know if the Windows SDK or VS Express versions work, but I’d assume so) or GCC.

It has support for generating Windows binaries, has support for some Windows-isms (the same ones GCC supports). They’re apparently working on supporting more MSVC extensions, so it can compile against Microsoft’s headers instead of the MinGW ones, and are working on a compiler driver compatible with cl.exe.

It looks like they have enough developer interest to make Clang work well on Windows, so it should get better, even if it’s not there yet.

(I’ve not actually tried it though. For all I know, it doesn’t work yet – I’m just going by their web site).

2011-03-30 12:43 am

tylerdurden
What were the flags, data sets used, etc, etc, etc?

2011-03-30 3:41 am

Valhalla
What were the flags, data sets used, etc, etc, etc?

While the data sets shouldn’t matter in any way I can only agree that the lack of information concerning the compiler flags makes it sort of a black box benchmark.

From what I’ve gathered these tests are done with the flags that have been set upstream (if it’s those in the original source packages or those chosen by some repository package maintainer I have no idea) but those flags may very well be set to very low levels of optimization and/or contain debugging flags which impact performance.

Other packages like x264 enables tons of handwritten assembly for x86,x64 by default which pretty much renders the tests worthless as a comparison between compilers, and afaik Phoronix’s tests do not disable this assembly code when doing their benchmarks.

A (imo) proper test would be to compile all packages at atleast -O3 (and possibly -O2) and compare the corresponding results.

As it is now, an upstream package may come with compiler settings either intentionally tailored to a specific compiler or one that by chance suits one compiler better than it does another which may not reflect the performance of each compiler when told to generate their fastest code (usually -O3).

Not comparing the compilers at a specific (or several specific) optimization level (preferably the highest if only one is to be used) means that the test-results may often be a poor reflection of the actual compiler capacity.

I can see that Phoronix may shy away from testing a large set of optimization levels but then they should atleast settle on -O3 which is the level which from the compilers standpoint *should* generate the fastest code. As it is now, the packages they benchmark may have -Os, or -O2 for all we know and since there’s really no fair way of measuring performance of ‘middle’ settings (how can you decide if -O2 on one compiler corresponds to -O2 on another? maybe -O2 is closer to -O3 in compiler A, while -O2 is closer to -O1 on compiler B) the only ‘fair’ way would be to either compare several optimization levels against eachother or only the highest.

As it is now, I find Phoronix test results interesting but I do take them with a large grain of salt. I will continue to rely on my own tests and tests where all relevant data is presented for easy verification.

I do applaud Phoronix for doing these benchmarks, I just wish they weren’t done in (again imo) such a sloppy manner.

2011-03-30 9:46 am

tylerdurden
Indeed, data sets and their characteristics do not matter, it is not like a computer program’s main function is to process data or anything.

For example, different schedulings can have vastly diverging behaviors in performance, esp. given to the cache and load/store queue architectures of modern out-of-order multicore processors. And the characteristics of the data set are fundamental to properly understand the behavior being observed and benchmarked. Similarly, it is fundamental to understand the type of data distribution, to understand for example if the compiler is scheduling efficiently to keep the the multiple functional units busy. Etc, etc.

So yes, understanding the data sets, as well as the instruction mixes is fundamental when benchmarking different compilers and their performance properly.

I don’t consider studies done with such huge omissions to be useful at all, a waste of time if anything. Although I understand it is a relatively easy way of filing up 8+ pages of contents with graphs.

2011-03-30 11:30 am

vodoomoth
What on earth are you talking about? They are testing compilers: that’s feed them some source code and inspect the machine code on a specific aspect and here, they’ve chosen to inspect either the speed of the compiled code or the time taken by each compiler to compile.

What do you expect as “data sets”? The source code of each program that compiled? Or for instance, the data used as input to the compiled program? like the video used for the x264 encoding? or the files used for the 7-zip compression?

2011-03-30 8:33 pm

AnyoneEB
Note that -O2 is used in GCC because it often (usually?) produces faster code than -O3. For some discussion on the topic, see Gentoo’s documentation page on optimization flags: http://www.gentoo.org/doc/en/gcc-optimization.xml . Basically, it sounds like the GCC optimization levels are the separated by the amount of work the compiler has to do to optimize the code, and the extra work done by the -O3 optimizations tends to increase code size (and therefore hurt caching) so it often slows down programs.

That said, testing compilers at multiple optimization levels would likely be more informative about how good their optimizations actually are.

Edited 2011-03-30 20:34 UTC

2011-03-30 10:13 pm

Valhalla
Note that -O2 is used in GCC because it often (usually?) produces faster code than -O3.

Well, the fact that -O2 does beat -O3 sometimes is why I wrote *should*, but from my experience -O3 usually beats -O2 on both GCC and LLVM. Which is as it should be, since -O0 is no optimizations, -O1 is slight optimization, -Os favours code size over speed, -O2 tries to strike a balance between code size and speed, and -O3 will opt for maximum speed at the cost of code size.

The reason -O2 sometimes beats -O3 is most likely due to flawed heuristics resulting in cache misses and failed branch prediction etc by some of the more advanced optimizations enabled by -O3. Cache optimization is sensitive to cpu platform settings, so using ‘-march=native’ would be a good choice for code to perform as good as possible on your machine.

It’s interesting though that while I’ve found -O2 to beat -O3 on certain tests using GCC and LLVM, when I’ve tried Open64, -O3 has always performed much better than -O2, so in a -O2 test between GCC, LLVM and Open64, Open64 would likely be at a disadvantage, hence why I think it’s apt to go for the option that is *meant* to generate the fastest code (-O3), OR benchmark compilers across several optimization levels.

Also note that what once was faster with -O2 may not be faster with the next iteration of that compiler, given that heuristics improve (sadly they also sometimes regress). This is a very difficult part of compiler technology which is why optimizations such as PGO (profile guided optimization) is so effective. It is also why programs like the Linux kernel makes use of C extensions like __builtin_expect and __builtin_prefetch to guide the compiler when optimizing for branch predictions and cache prefetching.