After years of delivering faster and faster chips that can easily boost the performance of most desktop software, Intel says the free ride is over. Already, chipmakers like Intel and AMD are delivering processors that have multiple brains, or cores, rather than single brains that run ever faster. The challenge is that most of today’s software isn’t built to handle that kind of advance. “The software has to also start following Moore’s law,” Intel fellow Shekhar Borkar said, referring to the notion that chips offer roughly double the performance every 18 months to two years. “Software has to double the amount of parallelism that it can support every two years.”
If CPU companies want their multicore CPUs are better supported by software, they must start by making it possible in the easiest way possible. Maybe enemies like AMD and Intel must join their efforts on this.
What about stronger GCC support? That could be a great start. I hope AMD collabores more in open-source like Intel does now.
Please give a big welcome to the world of altivec!
R&B
Stronger GCC support?
You can pretty much forget about that unless you’re thinking patch releases from Intel/AMD that will never appear in the core release, because of the GCC policy of not accepting platform specific stuff.
Intel have their compiler and performance libraries already and AMD have only performance libraries. That’s good enough for now really.
You can pretty much forget about that unless you’re thinking patch releases from Intel/AMD that will never appear in the core release, because of the GCC policy of not accepting platform specific stuff.
Could you expand on this? Every back-end to gcc is platform specific. Gcc vector extensions are heavily dependent on the target architecture. Heck, the entire -march=foo tag is used to enable platform-specific stuff. In what way is the gcc team unwilling to accept things that are platform specific?
The compiler can’t do a lot for you to help with multicore support. Generally speaking, you need to change the fundamental design of the software to do that.
About all I could really see Intel / AMD being able to do would be to provide some nice multithreaded matrix libraries for scientific use. That’s one of the few things SGI has going for them to retain customers. They did a create job of it. They have libraries you can just pass a standard matrix to, and it will use however many parallel threads as you want to solve it.
Outside of that, there isn’t much that’s generic enough that Intel / AMD could do a lot for people.
GCC 4.2.0 (just out) is the first release to support OpenMP, something you want to learn about if you’re interested in writing parallel programs.
http://www.openmp.org/
I agree with the first post. The architecture of current mainstream processors is not conducive to parallelism or efficient software. Meanwhile Compaq/HP killed Alpha and Apple switched from PPC. Agh. Frustrating.
Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86. Indeed, in many respects (sane memory ordering model), it’s much more conducive to shared-memory parallel software.
Old McDonald had multi-threaded PPC code, eieio.
It stands for Enforce In-order Execution of I/O.
Maintaining the view that loads/stores execute in-order makes writing multithreaded code easier, not harder. You don’t have to clutter your code with fence instructions in x86 as you do on some other architectures.
Oh, I agree that x86 is right and PPC is wrong in this respect. While Core 2 does speculative reordering of loads ahead of stores, it maintains the semantic intent of the instruction stream it receives, paying a penalty of a pipeline flush if it mis-guesses the uniqueness of memory addresses. The programmer shouldn’t have to police the processor’s reorder unit wherever the instruction order is critical to the correct operation of the program. The processor should always assume that the instructions are ordered in a given way for a good reason. If it can prove that an alternative ordering is semantically identical, then optimize away. But if it can’t, then it must assume that order matters.
One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance. It’s interesting because the Core 2 is the a wide, aggressively OOO core topping out at 2.93 GHz, whereas POWER6 is a relatively narrow, mostly in-order core at a blistering 4.7 GHz. This says a lot about the trade-off between issue-width, instruction reordering, and clock frequency. One would assume that the POWER6 team hit their transistor budget and frequency target by thinning and simplifying the pipelines. Not only did this help them reach their performance/watt target for multi-threaded server workloads, but they also won the one-to-one pipeline comparison.
Armed with new evidence that clock frequency is inescapably central to single-threaded performance, above all other forms of architectural enhancements, and combined with the brutal relationship between frequency, voltage, and power, we find ourselves in a bit of a pickle. We know how to increase performance/watt on the server, where multi-threading is a natural extension of the workload. But how do we improve performance/watt in client workloads characterized by I/O-bound, latency-sensitive applications that are difficult to parallelize?
Actually, I think that the POWER6 team might have stumbled onto part of the solution. Put the execution units on a strict diet. Keep the load/store reordering, issue everything else in-order. Offset frequency increases by cutting transistors and aggressively pursuing process shrinks to allow lower voltages. Power is linear with frequency and transistor count, while it’s quadratic with voltage. Invest the extra die space in caches, which help immensely in reducing noticeable latencies on client systems, while being fairly easy to optimize for power efficiency. Processors spend huge amounts of time waiting on loads. Transistors are much better spent implementing caches than implementing wide and aggressive pipelines.
I think this Intel fellow is way off the mark, quite frankly. Each x86 vendor has started with an architecture from opposing ends of the spectrum and worked toward the center. Intel started with a mobile design and adapted it to multi-core, wide-issue design. AMD started with a server architecture and adapted it to power efficiency. In the center lies this low-volume, high-margin, prestige market segment populated by gamers, enthusiasts, and multimedia professionals. This is the least power-conscious segment of the entire market. While Intel remains better positioned to provide strong performance/watt in the high-volume mobile sector, they have clearly made compromises in their design in order to win performance benchmarks and the associated enthusiast mindshare.
So look no further than your own choices, Intel. Software has its own problems to deal with without having to worry about thread-parallelism that may not be realistic in many cases. Software is unreliable, insecure, expensive to maintain, and threatened by a broken intellectual property system in some nations. There’s way more complexity to software development than winning benchmark contests, and in this day and age, that holds true for CPU vendors as well.
Oh, I agree that x86 is right and PPC is wrong in this respect.
I see that. I didn’t get the eieio joke…
One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance.
I’m not convinced that this is really the whole of it. If you look at the SPECint_base for both chips, they’re roughly the same (Power6 – 17.8, Core 2 17.5). Apparently the Power6 gets a HUGE boost from profile-guided optimization. I can see why profile-guided optimization would provide such a boost in performance for an in-order chip, but I’m not convinced you’re going to see that in real-world code. SPEC is something of an optimal case for PGO, since you basically run the same data set that the profile info was collected for.
That said, I definitely think you’re right that the Power6 folks are on to something. Achieving even performance-parity on integer code with the Core 2 is no mean feat. Whatever they’re doing in the memory pipeline is clearly a promising technique, and likely to get better as it becomes more mature. There are still a lot of unknowns with regard to Power6 (how do you schedule for it? can the chip cover cache latency or do you have to take that into account? how sensitive is it to scheduling?) but it’s clearly something to keep a watch on.
I was going after their push to multi-core when there’s plenty of gains to be had one single-core systems. There’s an undertone that all they have left is adding cores.
There is a limitation to how much one can squeeze out of a single processor before the actualy returns start to deminish to almost nothing. Sure, Intel could push the clock up wards again, but the issue of diminishing returns would come back to haunt them.
What there needs to be is a great emphasis at university on teaching youngsters how to code properly the first time; spending time teaching people how to actually design their application before they write it rather than simply throwing code at a problem and hope that it actually works correctly.
Anything is possible once good discipline enters the equation; the problem is, far too many programmers don’t have it. Code it now and let some sucker sort out the issues in 20 years time.
“Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86.”
There were other great architectures which had parallelism in hardware, i. e. real CPUs, not just CPU cores. These systems had a better “throughput”, but this was many years ago.
– Intel Pentium 4 @ 1700 MHz, 575 int, 593 fp, 0.687 per MHz
– AMD Athlon @ 1333 MHz, 482 int, 414 fp, 0.672 per MHz
– DEC Alpha 21264A @ 833 MHz, 518 int, 590 fp, 1.330 per MHz
– HP PA 8700 @ 750 MHz, 569 int, 526 fp, 1.460 per MHz
– MIPS R14000 @ 500 MHz, 410 int, 436 fp, 1.692 per MHz
(SPEC 2000 INT / FP BASE values)
Of course, this professional equipment was not designed to take any market share in the home computing and entertainment market.
None of the architectures you mentioned have anything special with regards to thread-level parallelism. Some may or may not be more conducive to instruction-level parallelism than x86, but that’s a different bag of cats.
“None of the architectures you mentioned have anything special with regards to thread-level parallelism.”
You’re right. Especially Sun’s and MIPS’s processors did parallelism in reality, i. e. using several processors with own channels channels, own I/O, even own memory.
“Some may or may not be more conducive to instruction-level parallelism than x86, but that’s a different bag of cats.”
They’re quite old, they come from a time when threaded parallelism seemed to be of no interest, maybe that’s why they’re not very good in competition, even with their much better troughput. But this fact matters in regards of iteration speed where they beat Intel’s and AMD’s x86 products. As it has been mentioned before, threading and parallelism isn’t applyable everywhere.
They’re quite old, they come from a time when threaded parallelism seemed to be of no interest, maybe that’s why they’re not very good in competition, even with their much better troughput. But this fact matters in regards of iteration speed where they beat Intel’s and AMD’s x86 products. As it has been mentioned before, threading and parallelism isn’t applyable everywhere.
They come from such a time, but the lack of interest comes from that time coming after a time where there was a lot of interest.
Everything mentioned in this topic, and a lot more, was tried in ECL long agon, and modern architectures are pretty much a weeding out of all those ideas.
All that’s happening in BiCMOS is that more transistors are available so the edges of the spaces are being explored a little farther. Still going to hit the same boundaries.
Physics pretty much determines the limits.
I don’t agree with your topic. The “80’s” architecture has been rehashed a million times over and I think it’s been clearly demonstrated that the x86 we have today is not the same x86 that was in the 80s.
“[…] I think it’s been clearly demonstrated that the x86 we have today is not the same x86 that was in the 80s.”
Not the same, sure, but it’s a successor of the original architecture whose inner parts you can still find. Otherwise, backwards compatibility would not exist the way it does.
Erm, what about the A20, does it still exist? C:/WINA20.386, anyone?
http://en.wikipedia.org/wiki/A20_line
http://www.win.tue.nl/~aeb/linux/kbd/A20.html
this?
hell, reading that last one, one really starts to wonder what kind of house of cards modern X86 based computers are, even before the os comes into play…
Edited 2007-05-27 16:01
“this?”
No, the A20 is a highway in Mecklenburg-Vorpommern in northern Germany.
Yes, of course, this is what I meant.
As far as I remember, the strange story was this: The 8086 could address max. 1 MB RAM running in real mode. Fir addressing purposes 16 bit registers were used, together with the segment:offset procedure, ranging from 0000:0000 up to ffff:ffff. Having a look at this you surely recognize that this range covers more than 1 MB. Physical 1 MB RAM ends at ffff:000f. So if you address ffff:0010, the processor “turns round” and starts at 0000:0000 again. Then, the 80286 occured. Because he needed to be 100% compatible he emulated the described habit exactly, even if 18 MB RAM (its maximal addressable RAM amount) were installed. To address more than 1 MB RAM, the habit needed to be switched off. THis is what the A20 did.
“hell, reading that last one, one really starts to wonder what kind of house of cards modern X86 based computers are, even before the os comes into play… “
Thank you for the links, very impressive… and funny… from a non-x86 point of view.
A lot of the really old things (like A20) are emulated in the chipset these days. On an EFI system (Apple’s machines), the old BIOS-related cruft isn’t even there anymore. AMD64 gets rid of a lot of the weirder aspects of x86 (segmentation, TSS, ASCII instructions). Call gates and that crap have been deprecated in favor of sysenter/sysexit. No modern OS uses them, though they exist in micro-coded form for compatibility purposes.
x87 is unfortunately still there (though basically deprecated in AMD64). Of the instructions that matter, MUL, DIV, and IDIV still have an implicit accumulator (though oddly enough, IMUL has a 2-operand form). Lot’s of instructions still have accumulator forms, though you can mostly ignore them. These are really the bits of fundamental cruft that you can’t just emulate or otherwise bury in microcode.
Edited 2007-05-27 18:45
Other than leaving it as a responsibility on the developers to restructure everything to take better advantage of all those cores, which, while doable, is not something that is often truly feasible or at least not economically sound (premature optimization being the root of all evil, and what could be more evil in many cases than attempting to take something you understand only serially and attempting to make it massively SMP-friendly?) there’s also the problem with the data: even if you somehow have many tasks to run on your desktop system, if you investigate carefully, you’ll find that a very large percentage of the time that your CPU(s) (however many cores each one has) are idle, and it’s NOT always because there’s nothing they’re scheduled to working on: they’re waiting for data, or some other event.
Even if you write your applications to use many threads and therefore as many cores as a developer can conceive how to employ (using lock-free/wait-free techniques can help a lot with that: look it up as a reference on Wikipedia) if there’s not enough I/O bandwidth, all those cores of all those CPUs will wait at the same speed as a slower or faster single core, and will utilize the system with low efficiency.
Now, back to the economic side of things: the reality is that it can be very hard to create a correctly done serial application that does what was intended, with any given budget of manpower and time. If you want to make things more parallel, for the sake of practicality, that requires more foreplanning of structure of a system and exact requirements than does a serial system: the nature of software development is often that the more details you put into a detailed plan before you start, the more you end up throwing away, as real-life requirements are rarely that well understood at the beginning of a project, and are often quite wrong, combined with the factor of what’s needed NOW. Depending on the application, it may need a certain complex structure to exist before making it more parallel: doing this can add considerable time and expense to do, and in some cases, the task dependency and data dependency graph isn’t understood well enough to do that ahead of time, and you need to do some major rewriting if you guess wrong, because the result may be something that deadlocks or has race conditions or other things like that all too easily.
Next, someone may say, “Get more people working on something!” and that works, somewhat, up to a point: if you want a great example where that may not work as fast as needed for the amount of manpower, look at Windows Vista: when a system has that many things that depend on each other, even the most careful coders are going to have a heck of a time synchronizing with each other and where they’re at, and there will be a lot of communication overhead.
The problem with Windows Vista has nothign to do with lack of or too much man power. The issue for crap software always goes right back to the source – crap system analysis and design. Instead of spending 6-12months generation piles of documentation, specifications and so forth, you have programmers working in isolation throwing code at problems hoping that something will stick. That is not a long term viable solution, especially when projects start to become large and complex.
Windows has become a mess because they simply didn’t follow basic programing design, something that a programmer will learn in the first year of IT; the fact that you don’t hit the keyboard until a tonne of paper work is done, and even then, it isn’t just a matter of jumping in and coding.
But this goes beyond Microsoft; look how many opensource projects (or parts of) need to be re-written every 6-12months because of inadequate analysis and design before hand. Lack of definitive milestones, schedules and foresight to include the ability to expand without breaking compatibility.
Back to Windows, the fact is, with poor methodology coupled with bad resource management by Microsoft managers coupled with a lack of a road map, and the constant array of spaghetti code which ends up causing all sorts of problems – change something somewhere, and it breaks something somewhere else. Isn’t it not surprising that it occurs.
Microsoft isn’t the only company guilty of this; heck, look at the likes of Adobe, for example, who have multiple code bases for products that pretty much do the same thing at the most basic of levels; why not have a ‘core’ backend and simply build interfaces and tools upon each of those cores for each product required? a DTP front end for Indesign, a Graphics front end for Photoshop. When enhancements are made to the core, all of the products inheriet the enhancements.
A problem with software is that the Operating System presents a barrier to progress. It’s one thing for the software houses to improve parallelism, but if Microsoft, Apple & Linux &c. are not pulling their weight first, progress will be stalled. There needs to be the right tools, and APIs to handle parallelism.
Generally I think this is beginning to be tackled. On the Apple front I’m aware of NSProcess and the new multi-threaded openGL, I don’t know how the other OSes are doing things at the moment, but I’m certain it’s all there.
This means that speed increases are not going to be as ‘free’ as before, and OS releases may have to be more frequent to utilize new hardware appropriately. For users of proprietary OSes, this is only going to mean a rising cost in the future.
On the other hand it might buck up the ideas of some vendors about their shoddy software practices. (*cough*Adobe*cough*Microsoft)
Operative systems already have the proper APIs, and their internals are already multiprocessing-efficient.
Operative systems have no bussines in how userspace wastes its time. And it’s userspace where the main problem is: There’s no know way to divide that time between several processors. There’s nothing operative systems can do to fix this. You have multithreading, but that’s not the full solution.
Edited 2007-05-26 23:22
if Microsoft, Apple & Linux &c. are not pulling their weight first, progress will be stalled
I once read something called Gates’ Law: Processors double in capacity every twenty-four months, but programs halve in speed every eighteen.
Don’t know if it’s Bill Gate’s law, but it’s an unfortunate truism.
I don’t know how an OS would deduce and enforce shared memory dependencies between threads in a process. The application developer has to explicitly declare how access to data should be serialized and/or synchronized. It could be as easy as declaring that objects of a certain class can only be operated upon by one thread at a time. Or that a thread cannot pass a certain point until all other threads have reached this point the same number of times.
There’s only a handful of commonly used schemes for implementing good multi-threaded code. Some languages have primitives built-in. Others don’t. But these concepts are never hard to implement, for example, by embedding a lock in a struct or class. More often, the problem is that the application developers simply don’t understand their code enough to know what code needs to be a critical section or what data needs to be protected by a lock. The OS can’t help here.
The OS also can’t help application developers figure out what parts of their application could be parallelized. Most desktop applications just don’t have much potential for thread parallelism. Consider a word processor. Most of its time, it’s waiting for user input. The most common user complaint regarding the performance of word processors is that they take too long to start up. Well, application startup and initialization is quite I/O-bound and usually very serial in nature.
I think what people often miss about the direction of client system architecture is that multi-core matters as much for graphics and multimedia processing as much as it does for general-purpose computation and logic. In just a few years, each of these kinds of execution units will begin to coexist as modular cores sharing a common bus architecture. Eventually, they could share decoders, dispatchers, and L3 cache. A typical client probably won’t have more than 4-8 general-purpose cores, and many will have only 2. But a high-end client might have 32 graphics cores or more. Graphics and multimedia code has high potential for parallelization, whereas general-purpose code is highly serial on the client.
The server is a whole different story. Servers are evolving to have more and more general-purpose cores, and there’s no end in sight. Their general-purpose workloads are highly parallel. Serving many requests concurrently is the classical strength of multi-threading, and most server applications are quite good at it. The integration of vector and general-purpose cores is big the HPC market. Server applications run great on big multi-socket systems.
As long as concurrency is not a part of most languages by design (except ADA, Java and a very few others), just using whatever the OS provides isn’t enough and never will be and it won’t be portable to other OSes.
We had this figured out 20yrs ago with Occam based on sound mathematical principles and CSP, (Communicating Sequential Processes) which models processes on a hardware way of thinking. While Occam was pretty weak by todays standards in data structures and objects, it was designed for a world of programming where many threads or processes ran on one processor (Transputer) or many processors with almost no change to the code. Most of the applications tended to be DSP crunching where the math could easily be distributed and synchronized.
Such languages are actually very easy to use to construct large scale concurrent applications, if the opportunity is there in the algorithm, it is usually easy to express it in CSP based languages. Not surprisingly, Hardware Description Languages are close cousins of Occam but usually lack data structures.
I would like to see concurrency in software languages as first class, as it is in HDLs, probably lead to a common language for both hardware and software design although not all features would be used on both sides.
Like people doesn’t knows this. This is not new. Intel also said in the past that software and compilers needed to evolve to use all the power from Itanium. And look at it.
Now let’s get serious: There’s NO a know way to achieve what Intel says. Do they think that just saying it will make people do it? The thing is, software people WISH they could do it, but they can’t. Does Intel know how to make multiprocessing easy enought? Software people would be REALLY happy of know it. But since Intel hasn’t said anything about it, it looks like they don’t know anything.
Which is weird. Because Intel and AMD would not risk its existence building hardware that is useless because its power can’t be used, would them? I mean, they’re not going multi-core without knowing first how to make software to use them? I wouldn’t not put my money in a company like that.
Let me quote Robert O’Callahan from Mozilla: http://weblogs.mozillazine.org/roc/archives/2007/05/status_2.html
“There was a great deal of discussion of parallel programming, now that we have entered the multicore world. More than one person opined that multicore is a mistake we will live to regret — “we don’t know what the right way is to use the transistors, but multicore is definitely wrong”. There was general agreement that the state of parallel programming models, languages and tools remains pathetic for general-purpose single-user programs and no breakthrough should be expected. My position is that for regular desktop software to scale to 32 cores by 2011 (as roadmaps predict) we’d have to rewrite everything above the kernel, starting today, using some parallel programming model that doesn’t suck. Since that model doesn’t exist, it’s already too late. Probably we will scale out to a handful of cores, with some opportunistic task or data parallelism, and then hit Amdahl’s law, hard.”
Edited 2007-05-26 23:24
This is no different from when languages started migrating from procedural to object oriented.
Lots of programmers (ones I worked with even) were unable to properly transition their thinking to OO and ended up writing terrible software.
I think this move to multicore is going to have a similar effect. There will have to be a general paradigm shift, although I don’t know if it will have the same impact the OO shift had.
If anything, I’m very glad multi core machines are at the desktop pricing level. In march the very small company I work for picked up a dell 8 core clovertown for $2500. It’s been great to have. It even showed how very *badly* old thread code from 5 years ago can actually dramatically lower performance when applied to multi core systems.
These days I like to talk abouut “instruction level” vs “task level” parallelism. Compilers and libraries are critical for the first while engineering is needed for the second.
Are we forgetting that not all tasks can be run in parallel?
If they can’t make the CPU faster, they should just admit it and not hide behind a “solution” which can only be applied to so much software.
That doesn’t necessarily matter.
Alot of serial codes that consists of { A,B,C,D } etc in serial fashion but each of those blocks may be self contained doing whatever, such code blocks often can be run in any order so instead we should say Par { A,B,C,D } etc and let A,B,C,D each complete before continueing after the group.
The CSP languages assume that much of serialization is artificial and allows the cpu or OS process scheduler to run them in any order until all complete. The various branches can communicate through channels or messages to control the flow of data between them.
If we really really want A,B,C,D to run in sequence then we nest it in a seq { ,,, } block. Seq is worth using when the cost of the operations is too small to justify the overhead of the par statement but right now all we have is seq and whatever theOS provides.
And that’s my point. How much of that is actually useful in turning into parallel? Or efficient?
I think there’s not much software that can really benefit from threading, and the biggest benefit of running multi-core is just being able to give each software package its own cpu core to run on.
Alot of serial codes that consists of { A,B,C,D } etc in serial fashion but each of those blocks may be self contained doing whatever, such code blocks often can be run in any order so instead we should say Par { A,B,C,D } etc and let A,B,C,D each complete before continueing after the group.
You’re still waiting for the slowest to complete though, and any performance gain you’re going to get is going to be extremely negligible for the effort you’re putting in.
The various branches can communicate through channels or messages to control the flow of data between them.
Again, it’s added complexity for no appreciable gain.
In the mid-90s, Monica Lamb, one of Hennessy’s students, and others, at Stanford, pretty conclusively demonstrated that the sort of ‘accidental’ parallelism that you speak of is very rare in application programs.
If it were more common, then Itanic would do a much better job, since Josh Fisher, et al, did a good job on the original fine-grained parallel compiler for it at HP. That was, after all, the whole point of VLIW machines.
indeed, some tasks can’t run in parallel. But you generally aren’t running only one program at a time.
And here I was, hoping that we’d see software perform faster with less resources as compared to the systems they run on. You know, like Mac OS X and KDE and Gnome, only moreso.
“And here I was, hoping that we’d see software perform faster with less resources as compared to the systems they run on. You know, like Mac OS X and KDE and Gnome, only moreso.”
People are doing nearly the same on actual CPUs as they did in the 80s – treating their computer as a better typewriter. The quotient SPEED = HARDWARE / SOFTWARE seems to stay the same because HARDWARE++ offerings are relativated by SOFTWARE++ requirements…
Just to mention: KDE 4 runs as fast on recent modern hardware as Geoworks did on a 386.
but you have to admit, KDE4 looks oh so much better
“but you have to admit, KDE4 looks oh so much better “
In fact, it does, but using its configuration utility it can made looking really worse…
So if I have this right, the processor companies have run out of things to sell us and parallelism is the new Moor’s Law? Brilliant.
Well, the processor companies are going to have to seriously help developers out here, because it’s difficult enough to create serial applications, on time, on budget and to requirements, let alone a parallel one. Most serial applications work very well, thank you very much.
Creating multi-threaded applications is a big undertaking, basically because it’s a userspace problem. There’s no universal way in which you can turn a serial application into a parallel one, or just arbitrarily create a parallel one. Logically there might be certain things in the application that can genuinely run in parallel, and other things simply cannot because they depend on other threads and you will simply have to wait for the slowest one to finish rendering the whole thing useless. There may also be certain things that you genuinely believe you can run in parallel, when it turns out you can’t for some other reason – and very bad things happen. This opens a whole can of worms on the bugs front.
The only benefit to multiple cores is in the area of processes, and being able to run each process, and application, on one core.
Yeah, it had its problems, but the basic concept was good, and having the OS multiprocessor-aware & multithreaded from the get-go was great. I’d love to have seen it run on the newer multicore chips at the clock speeds available now — it would have seemed like it was responding to you before you even knew what you were asking it
This really doesn’t suprise me the least.
Today there is too much of an “add more RAM and a larger CPU” to get more performance out of software, rather than actually trying to write efficient, lean code.
If Intel were to freeze the progression on processing power, then maybe, just maybe we’ll start to write better code and perhaps get a performance gain that is far beyond what we ever expected possible through software alone.
Personally, writing fat inefficient code is just being lazy, both from a technical perspective and a business perspective.
This is a really poor argument. There is a trade-off between stability/robustness, performance, speed of development, ease of maintenance, and features. More of one necessarily means less of another.
Some people would like to think that if we increased the relative importance of performance, by holding CPU performance to a specific level, we’d get software that was as functional as today’s software, only faster, but that’s a fantasy. We’d just get bugger, less-complete software that took longer to develop and progressed more slowly. Some people look back fondly on the early 1990s, to software they perceived as “lean” and “focused”. But by and large, most of it was crap compared to the stuff we have today. I mean find any serious user of Aperture and tell them they’d be perfectly happy with Paint Shop Pro 1.0!
For most purposes, “efficient, lean” code is not very useful. The goal of any software is to meet target requirements (performance and features) while controlling costs. If fast processors means that you can write good code*, then why not take advantage of that? It makes a lot more sense to buy more (cheap) processors, than to hire more (expensive) programmers, or to waste your user’s time on software that isn’t as robust or complete as they need.
*) By and large, highly-optimized code is poor code. Well-abstracted, elegant code is usually slower, but it’s much, much easier to maintain in the long run.
I still disagree and for the record I don’t believe it’s a poor argument. To my mind, justifying sloppy, unprofessional and inefficient code the way you have is a horrible example of why we have such bloated software these days.
Backward compatibility has a place here too I believe, but to say that lean and efficient code is not very useful is narrow minded, short sighted and more a means to justify one not being up to the task.
To my mind, it’s wiser to write smaller, fast & efficient reusable modules.
Edited 2007-05-27 01:46
I’m not justifying sloppy, unprofessional, inefficient code. I’m justifying clean, elegant, maintainable, inefficient code
You’re lumping in all the attributes of good code (small, reusable, professional), into the “efficient” box, and lumping in all the attributes of poor code into the “inefficient” box. But that’s a highly illogical claim. Less-optimized code is at least as small, simple, and reusable as optimized code, and usually much more so. An unoptimized program must only be concerned with solving the problem at hand. An optimized program must both solve the problem and run quickly, creating an additional constraint on the solution. The solution set under the unoptimized condition is a superset of the solution set under the optimized condition. Therefore, the former set contains any “small, elegant, reusable” solutions that the latter set does, and very likely many “smaller, more elegant, and more reusable” solutions that the latter doesn’t.
More concretely, “optimization” is the process of taking advantages of special cases in more general algorithms, thus realizing a performance gain. Recognizing and utilizing these special cases almost always makes the code larger, and inherently makes the code less reusable (by making it rely in additional assumptions). It also almost always makes the code less elegant, but cluttering up the logic of the algorithm with said special cases. There is no getting around it — it’s just simpler when code only deals with solving the problem, rather than solving the problem and doing it quickly.
As for the “bloated” software of today; go use NT 4.0 for awhile and tell me how awesome software was back in the day. Yes, it got things done quite nicely in 64MB of RAM. But it’s a relic best left in the past!
Some good points and clear reasoning, still a little narrow minded IMHO.
(1) It seems flanque talks about target code while rayiner means source code. Good “optimizations” may or may not make the source larger, however it very often makes the target code smaller (and therefore faster).
(2) A significant performance gain may certainly be worth the cost of some added development time and, perhaps a slight reduction of generality, especially when this can be isolated to a module in a well structured system (OO or not). As we are all forced to balance between simplicity and efficiency (goals that are not even always conflicting if we invest some thinking), why not draw the line a little nearer efficiency when possible, and in situations where the gain is great? Often only small parts of a certain application are craving real efforts and/or gives large rewards. Ignoring such opportunities is, IMHO, quite unprofessional (despite considered “cool” in some circles).
(3) What about if hardware designers had the same attitude towards their work?
(4) More fine grained linking (static and/or run-time) could probably solve some of the problems concerning so called “bloated” code, though this would often need additional support from the languages used (many of which are very “spartan” today).
Edited 2007-05-27 23:59
I still disagree and for the record I don’t believe it’s a poor argument. To my mind, justifying sloppy, unprofessional and inefficient code the way you have is a horrible example of why we have such bloated software these days.
I disagree. Code has to be made reusable in order to reduce code complexity while increasing features. In order to do that you have to refactor a lot of code into reusable libraries. When you do that it takes a lot more resources to use those than it does to use smaller optimized routines within the program itself. It is inevitable. If we want software to progress we’re going to have deal with this reality. That isn’t to say that there isn’t bloated software out there but I hate it when people lump everything useful into the bloated category while they suggest that fast and featureless code is much better just because it is fast. To me bloated usually means “backward compatible”.
When you do that it takes a lot more resources to use those than it does to use smaller optimized routines within the program itself. It is inevitable.
I don’t think it’s some law of nature – more of badly designed languages and linkers in combination with somewhat ignorant programmers. If you disagree, please explain why it’s inevitable that it takes a lot more resources (preferebly on the machine level).
I don’t think it’s some law of nature – more of badly designed languages and linkers in combination with somewhat ignorant programmers. If you disagree, please explain why it’s inevitable that it takes a lot more resources (preferebly on the machine level).
The reason is that those libraries need to have generalized solutions that can be used by many different applications. They have to assume as little as possible and handle every possible edge case, no matter how unlikely it is to happen in the real world.
As a completely unreasonable and made up example, take this. I need to store a list of numbers, all of which are going to be a power of 2. It’s unlikely I’m going to use a data structure designed specifically to only hold powers of 2, instead I’m going to use one that keeps generic integers of any type. If I wrote something particularly for my program, it could be a lot more efficient because it knows that all the numbers have something in common. Nevertheless, I’m probably not going to bother since the more generic solution works well enough, won’t break and force me to recode if I ever change my invariant, and I’m not likely to find a specific solution to my problem already done for me. It would take a lot more time and effort, for little gain.
Another example: collections. Designing collections APIs always involves a trade-off between performance and reusability/generality. Say you need to search a tree for the largest value. The fastest interface would be to have a function find_largest_value that operated directly on the internal node data structure of the tree. But in practice, you don’t want to expose the guts of the implementation like that. In the pursuit of good, reusable code, you create a general iteration or enumeration API, and implement find_largest_value in terms of that.
In very specific cases, smart compilers can optimize away the overhead of the interface. For example, if some code uses iterators to sum the elements of a vector, the compiler may turn it into loop that operators on the raw vector. In the same situation, a Lisp compiler may turn MAP over a vector into the same loop. But compilers are not smart enough to optimize away the overhead of the interface in more general cases. To go back to the tree example, I’ve found that a function like find_largest_value is about half as fast using STL trees than using a hand-rolled tree implementation and manipulating the guts directly. That said, in practice it’s usually better to use the highly-tuned STL tree code than to roll your own inferior version, or worse yet use a simpler, less-suitable data structure entirely!
Edited 2007-05-28 03:32
Ok, I see your point (I suppose you mean a set of numbers, otherwise I’ve got problems follow your example).
In cases where a special-case rewrite gives little gain (as you put it), then it’s fine as it is – with libraries containing fixed, native machine code. However, as there sometimes other cases, it would be very useful, and natural, with a language-system that instead stored some kind of intermediate code – with type-info from the source code retained – in “half-compiled” libraries. Special cases could then be exploited within the library code (using invariant info for instance) to yield faster/smaller instead of slower/larger code, because the compiler has the possibility to do a late (possibly JIT) and/or global optimization.
I admit this has many problems (including reverse engineering of code), however I think it’s a natural way to go forward, as it seems most “low hanging fruit is already picked” when it comes to cpu speed.
For this to happen the languages used must be able to specify detailed type (and invariant) information in spots where (time or size) efficiency is crucial (but not necessarily everywhere else…!).
If you disagree, please explain why it’s inevitable that it takes a lot more resources (preferebly on the machine level).
Dynamically linked libraries bring their own performance penalties. Static linked libraries still force you to load large binaries into memory. It’s the same with programming languages. If you want to be able to create complex applications in a reasonable time frame you are better off with higher level languages than low level languages and that comes with a performance hit, usually because of VM or interpreter overhead. As complexity increases we are going to have to depend on higher level languages and libraries just to keep things sane.
“To me bloated usually means “backward compatible”.”
To people with large investments of time and money into legacy in-house apps, “backward compatible” means good “return on investment” It’s all about your perspective
To people with large investments of time and money into legacy in-house apps, “backward compatible” means good “return on investment” It’s all about your perspective
Then hire programmers to keep the software up-to-date or accept the fact that you have to use bloated, inefficient software.
Then hire programmers to keep the software up-to-date or accept the fact that you have to use bloated, inefficient software.
Thats what most companies currently do!
I think some of these chipmakers need to slow down and allow the demand for their products and software development to catch up. I personally do not run software that will benefit greatly from multi-processing. I think chipmakers are finding that humans and software development are not subject to Moore’s law.
true, for a long time now the driver of the home desktop computer sales have been games.
sure, there are stuff like audio, video and image work. but those combined do not get to the level of gaming imo.
and now that the gaming console have gone online, even to the degree of supporting voip while in or out of game and more, one have to ask oneself if not the home computer is starting to die of.
however there is the laptop and its recently appearing smaller brothers. but outside of the 20″ behemoths thats more like a squashed desktop then a true laptop they are not up there for the type of gamer that drives the cpu sale.
all in all it seems that what is needed now is more data transport, not more cpu power. to dust of the old car comparison (and it will be flawed, i know); even the biggest engine cant run at peak if the fuel flow cant supply it.
as in, the motherboard, the storage media and all that have to come together as a whole. you cant stuff a high end cpu into a system with slow drives and narrow data busses and expect things to fly. the cpu will just spend most of its expensive time waiting for the data.
How does he come to that conclusion? If you write software that operates using 2 cores, for example, and it meets expectations, why is it necessary to enable it to perform on 4 or 8 cores? Engineers have written asynchronous functions in software forever… why not just have the OS take those functions or subprocesses and break them out among the cores, managing the cache/memory for them by earmarking pages and maintaining a shared list (and a method of intercore communications) of them or some such thing as that?
There’s a bigger flaw: software is either written for one core or n-cores. The thread model is such that any thread can be scheduled on as little or as many cores as available.
Give me the language features and I’ll thread everything I can get my hands on. Transactional memory, parallel-ly executing blocks, anything so that the idea of threading isn’t just a “programming paradigm.”
Actually, check out where C++0x is going in this respect for some inspiring ideas:
http://en.wikipedia.org/wiki/C%2B%2B0x#Multitasking_utiliti…
Edited 2007-05-27 03:32
This seems like a copout from Intel. The only way for them to cram more transistors into smaller and smaller areas and increase performance is to add more cores. Now they want software companies who have made applications for single CPU machines for years to just redesign every application in their product line to support multi-core CPUs optimally. This is no small task and will takes years. IBM’s Power6 is debuting at 4.7 GHz. How many years will it be before we see those kinds of numbers on x86 CPUs? Four years ago I would have thought that x86 would be 4-4.5 Ghz by now.
Power6 makes a whole lot of compromises to hit that frequency target. It’s basically an in-order machine (allegedly with some clever speculative execution during memory stalls). IBM also doesn’t have to deal with the constraints Intel does. Power6 works within a 160W TDP. Core 2 has to scale down to 10W, and get a significant fraction of full performance at the 35W mark. Power6 has several times the die area to work with, so they can make a different trade-off between core complexity and cache size. Power6 is a very impressive high-frequency design, but it also has the luxury of a much looser set of design constraints than Core 2 does (not to mention the fact that it has the benefit of going into production a full year later).
Point taken. (same goes to PlatformAgnostic). Still I can’t help but think Intel is responsible for the expectations they set years ago.
Intel is now into power-efficient computing. Their target market is the burgeoning laptop space. There’s no way to increase the frequency without burning significantly more power, so Intel’s now going for lower frequency, higher-IPC designs.
IBM hit that high 4.7 GHZ mark by using some pretty advanced processes (those CPUs cost a whole lot more than Intel’s x86 market can bear) and by just entirely killing the Out of Order unit except for floating point. Perhaps that doesn’t work so hot for the x86 code that we have out there right now?
In ’85, I plotted the performance curve over time for ECL logic, which at that point was peaking at 250mhz, and concluded that the peak per-processor performance for BiCMOS would come in 2005 at 1ghz.
It turns out that I missed by one generation of Moore’s Law and the peak per-processor performance came in at 2ghz. Guess I didn’t sufficiently account for the affect of feature size shrinkage.
There’s nothing particularly striking about Intel warning that the per-processor peak has hit. The only real difference between the history of ECL’s performance as feature size shrunk and Bi-CMOS’s performance is that Bi-CMOS was waiting in the wings as ECL peaked, while there’s nothing waiting in the wings for ECL.
By the way, anyone pining for the days of ‘leaner’ programming should go to the library and look up the ‘software crisis’ of the late ’70s in old issues of Datamation. You really don’t want to return to those days.
… we won’t have to buy Intels new expensive products and can live with our old system! Ohnoz..
Ada ‘tasks’ have been around for 20+ years.
Learn it. Love it. Profit!
As for the ‘it’s dead, Jim’ crowd…
I say “787 baby!”
BeOS
BeOS has nothing to offer on this issue. All recent general OS handle SMP at least as good as BeOS did (which was not the case when BeOS was making his buzz), if not better, and can do nothing to make softwares use parrallelism “automatically” (not that other OS can anyway). Dealing with SMP in OS is done and relatively “easy”, because it is a problem OS theory/implementations had to face since forever, which is not the case of general softwares.
The problem of multi cores and apps is that many softwares problems are hard to tackle with parallelism in mind with current tools.
Not true, while most OSes can support multi-threading in BeOS you created a window it ran in it own thread. At this point each extension tended to be easier at add as a new thread.
One program I wrote started off as your standard two thread model. By the time the I finished the prototype I had 7 threads running independently of each other. This is not something I planned, it just naturally grew out of the way you find yourself programming a BeOS machine once you stop thinking in a single threaded manner.
In defense of those who point out that most code is limited in how parallel it can operate only three of the threads tend to busy all the time, the others spent most of their time waiting for input. However, when they were needed they responded instantly.
There are two totally different issues here: the OS capability for threads, and the api exposed at the graphic toolkit level. When you talk about GUI, you are already talking about really specific application. Hence, this does not concern the majority of the problems in parallelism (eg all the scientific computing world, database management, etc…).
In most Unices, the tradition is not to enforce any toolkit. Separating the GUI code from the backend code in different threads is a model which is used a lot in most of those toolkits. Again, this may have not been common in the mid nineties, but now, any non trivial GUI has several threads, and the BeOS model has not a lot to offer compared to modern toolkits in that respect. For example, QT offers a lot of portable primitives for multi threading, most classes are reentrant, etc…
http://trolltech.com/products/qt/indepth/threading
Having several thread, and keep the GUI part in one thread is the design most apps are using anyway nowadays. And doing it with Qt is not really more difficult than BeOS, and could be much better with high level languages than C++.
This is really a marginal issue in the grand scheme of parallelism. Also, there are a lot of problems with the way multi threading is enforced in BeOS (one thread per window), see for example the discussion there from JBQ, who used to work at BeOS, and is much more entitled to explain it than me
http://www.osnews.com/story.php/66/Interview-With-The-AtheOS-Creato…
Edited 2007-05-28 04:05
I love this quote in this context: “In this sense functional style programs are “future proof” (as much as I hate buzzwords, I’ll indulge this time). Hardware manufacturers can no longer make CPUs run any faster. Instead they increase the number of cores and attribute quadruple speed increases to concurrency. Of course they conveniently forget to mention that we get our money’s worth only on software that deals with parallelizable problems. This is a very small fraction of imperative software but 100% of functional software because functional programs are all parallelizable out of the box.” (source: Functional Programming For The Rest of Us, http://www.defmacro.org/ramblings/fp.html )
If Intel wants to increase the value of their processors, I think it would be very profitable for them to sponsor development on software written in functional programming languages like Erlang, Lisp, Haskell, amongst others.
I’m sorry but I have to burst that bubble. Functional programming languages can only parallelize problems that are inherently parallel. Problems that are sequential in nature cannot be parallelized, and you will recognize this fact when you hit the limitations of functional languages.
For most Lisp dialects, the problem is obvious in the language: They allow side-effects. Depending on the actual dialect, this can either mean that the language may not be parallelized at all, or that the programmer has to serialize execution explicitly using a (begin…) expression (Scheme example) when the problem is serial in nature, making it just the same as an imperative language.
Purely functional Lisp dialects, as well as Haskell, choke even harder on this problem. When the problem is serial in nature, they have to resort to programming constructs such as monads. Not only are these cumbersome and extremely difficult for most programmers, they also defeat any attempt of parallelization. Only in combination with lazy evaluation can you achieve some parallelism again – just as much as the problem is parallel in nature. This also ignores the fact that no destructive update of data structures can be done, which can be quite costly.
AFAIK, Erlang evades the whole discussion quite nicely. It simply uses explicit parallelization and is thus no easier to apply to parallelism as any imperative language.
allright:
give me 5 non-trivial problems that can not be parallelized.
(no user interaction, just plain processing)
server software, serving a high number of concurrent users should scale gracefully over the available cores
“heavy” desktop software: graphical processing, processing of video-data, 3D rendering, compilation, encoding data (music, video, whatever), encryption/decription of a large amount of data, mathematical software, … should also be able to make use of multi-core CPU power. (all input data is available, not all data is dependant on other parts of the data (if it is, why not change the dataformat to facilitate multi core cpu’s)
Multi core cpu’s are here, they are not going away. Better make software able to deal with this.
Probablity this requires a different mindset, this isn’t the first time in IT-history, is it ?
give me 5 non-trivial problems that can not be parallelized.
Any computation where the calculation of the Nth term depends on the result of calculating the (N-1)th term.
A lot of these problems come up in signal processing, which is why systolic arrays were once thought to be a promising architecture for such processing.
But the real problem isn’t “can it be parallelized”, but rather, “at what speed up, with what overhead, and to how many threads.”
Sooner or later, Amdahl’s law gets every parallelization to stop scaling, the question is for which ones sooner and which ones later.
Exactly. Parallelizing things is easy: make them faster by parallelizing is already much more difficult. As you said, all signal processing things which are recursive are really difficult to parallelize (they are already difficult to vectorize… I don’t have any background in theoretical compute science, but I would not be surprised if this is fundamentally the same problem).
Yes well as you say you can tack on parallelisation as Intel as doing, but exploiting the hardware efficiently – thats the hard part – and will always be hard whilst we program in languages that require us to represent the problem in a serial (i.e. opposite of parallel) nature.
I do have a bg in Comp Sc and vectorizing and parallelising and two words for, sort of the same, thing. Parallelisation is a general concept, vectorization is one way to exploit parallelism (a typical vector processor would be a GPU).
Im not sure about signal processing specifically, and im a bit dubious you say they are ‘recursive’ in general, but yes recursion does lead to a serial algorithm – one operation has to complete before the other begins because the latter requires results from the previous, and therefore the algorithm is fundamentally serial in its nature.
Also don’t forget that parallel software is very different from *efficient* parallel software.
Sure in pure functional software, there is more parallelism available, does-it mean that this is easy to create efficient use of it?
No! It’s very hard in fact to use efficiently this parallelism..
Only about 5% of software need to actually use the cores as such. If we don’t count games (they’ll benefit alright), only DB and highly sophisticated server-software or math/physics simulators really should use more cores. Compilers too perhaps. But all “applications” and general programs shouldn’t.. at least not yet. Why? Because we run more than one program at a time. With multiple cores we can now run 2 programs whereas each uses 100% of one core. Of course core <> CPU but the general situation is almost same if the memory/bus can handle it. I don’t see why standard applications or programs which aren’t speed critical should be forced into complex parallelisms. It’d just eat more resources.
Who is to blame for performance growth decline?
It seems to me that software can be blamed for that only if there is a single programming method to use for writing software; and because we have a huge amount of software out there it is really hard to blame them for not trying different methods to help performance growth increase.
On contrary hardware manufacturers didn’t try anything other than basically move their vehicles (the electrons) in copper conductors on a piece of silicon that can cope with the heat these vehicles produce, even worse thinning these surfaces to be so thin that these electrons start to leak from their intended paths(in the conductors).
I know that evolving will just be evolving, unlike revolutions that will fundamentally change all the things around us, think of flying and how it changed our experience with traveling, nothing on earth with its friction can match the speed the aircrafts can achieve on almost frictionless air.
Current manufacturers (IBM, Intel, others) tried parallelism to solve the problem and on lab they produced poor performance on 64 cores, which is not more than 20x current fastest CPU available right now, I say poor performance because efficiency was low and not even close from 30% at best.
So, what is the solution?! this sounds like the eminent gas depleation the world is facing; and the solution would be to prepair for a revolution; and the most familiar one that OEMs can play with is light.
Light or Photons/fiberoptics are currently employed in NAS (Network Area Storage) networking; so why not experiment with Fiberoptic motherboards with fiberoptic CPUs.
Light speed is enormous 300,000 m/s when you compare it with electrons and with its travel it produces way less heat if you don’t intensify their beams ( To produce LASER). and light have more colors than you can imagine thus you could carry more data than electrons 2bit physical property(on/off), eg like (red, orange, blue, green, purple,….etc). so you can send your entire mac address header on your NIC’s packet with a blink of light!!
Light or Photons/fiberoptics are currently employed in NAS (Network Area Storage) networking; so why not experiment with Fiberoptic motherboards with fiberoptic CPUs.
There is active research in this area, but it’s many years away from commercialization. The basic problem is coming up with the fundamental logic gates (AND, OR, etc) within an optical system. It’s a really hard problem, and there are not yet any solutions like those based on the good old transistor.
*Sigh* why don’t you research the topic more before saying such things: first electricity goes nearly at the speed of light even though electrons are slow, so the gains are not that big.
Then if you still want to use light either
-you use the light only to transport data, but this means that you have electricity light conversion which use power so you cannot do this too often (could still be very useful for out of chip communication).
-you want to use the light to do computation, but photons do not interact themselves, so you have to use some material and those interactions are ‘second order effect’, so you have to use high intensity of light: not efficient, much heat released.
People did a lot of research on these topic, and it’ll be probably useful for out of chip communications, but this will enable *more* parallelism not less!
“first electricity goes nearly at the speed of light ”
Unfortunately, that’s wrong; Electricity has a speed of 60Hz ie it is 60 oscillations/second that’s if you are talking AC; and if you are talking DC(direct current then electrons have to travel in conductors that has resistance which would be calculated in ohm.
AC (Alternating current that we use are generated by oscillating electrons in high voltage wires that travel cities and the main force to generate it is Gasoline that rotate a giant magnet because electricity and magnetism are interchangable.
“you want to use the light to do computation, but photons do not interact themselves, so you have to use some material”
For that I would suggest using prisms and other light emiting/reflecting/polarizing materials which would allow all sort of computations.
I don’t know if you’re joking or not but if you’re not joking, I advise you to take some physics/electricity lesson because there are so many things wrong in your post that I cannot correct them all..
The “speed of electricity” is not really a particularly precise concept. Electrons flowing in a wire do not move in an organized way along the wire. They move randomly at high speeds (~0.01c) within the metal, but because of the randomness of their motion, only make very slow net progress down the wire (on the order of meters per hour).
Neither of these are particularly closely related to the speed of a signal propagating in a wire. If you’re sending a signal down a wire by modulating an electromagnetic wave, then the speed of signal propagation can be very fast, on the order of the speed of light.
If you want to know more about these phenomena, Google for “drift velocity” and “group velocity”.
Now, as optical logic gates, I suggest you try building one with the materials you suggested. Use one polarization for “1”, another for “0”, and build an AND or OR gate. Do this without significantly reducing the intensity of the input light beams (remember, in a micro-processor, a signal might have to travel through thousands of these gates!). This means that polarizing filters are right out, since each time the signal passes through one of those, you’ve lost 50% of your intensity, and after a few of them, you have nothing left to work with! Once you’ve done that, you’re well on your way to a Nobel Prize!
It’s really not as easy as you might think. I know somebody who worked in this problem as a student project. He didn’t even try to attack the light intensity issue, just to build a single logic gate. After months of work, he still didn’t have one working.
Some additional notes:
1) You’re getting in the ballpark with your reference to AC current, but realize that the frequency of the carrier wave has nothing to do with the speed of signal propagation in a wire. Indeed, a simple 60 Hz carrier wave conveys no information, and thus the signal speed is zero! To carry information down a wire, you must somehow (using one of many techniques) modulate a carrier wave (which may be of any frequency, not just 60 Hz). Using the frequency of the carrier wave and the frequency of the modulation you can then apply some basic math to compute the speed of signal propagation.
2) Electricity and magnetism aren’t interchangeable. They’re different expressions of the same underlying phenomena, but they refer to different aspects of this phenomena, have different governing equations, etc.
Maybe people should try to use programming languages that make parallelism easy or give it to you for free, like all the functional programming languages. If you insinst on using C or C++ it’s no wonder your programs will not take advantage of multicores without much work (you don’t want to do).
The tasks that really need the power are things like video and audio processing, scientific calculations, and similar things. Unfortunately, those three languages (possible exception for lisp) are so ridiculously slow at these kinds of calculations that going multicore is completely useless.
I don’t agree that the power of multiple cores isn’t needed for most desktop applications. Ok, maybe you will not need to use all cores at full speed, but the advantage could be that, instead of using 1 core at 1GHz, you use 10 cores at 100MHz, with lower power consumption, and less heat production. The result then can be that the cooler maybe can be turned off without problems, or just removed from the CPU!! And we all love silent passive cooled computers that cannot be interrupted by cooler issues, simply because there is no cooler. B-)
PS: don’t forget that mobile phones have no room and power to cool fast CPUs. Combine this with the fact that both Google and Apple started to focus on mobile phones, and you know multi-core CPUs are important.
begin able to dynamically turn on and of cores as the workload increase and decrease would be interesting.
but there is also the data flow to be taken into account.
most desktop jobs are either data intensive or interaction intensive. the harddrive is just as much a limiter as the inability of the software to take advantage of all the cores.
as in, if you have 10 programs all asking for data from the harddrive at the same time, things just crawl to a standstill.
maybe SSD’s will help here, maybe not.
its interesting that most storage media in use today share basic working principles with the first sound recording systems. mechanically moving media that store the information wanted.
and its that mechanical moving of the media that can be a limiter. as data is needed faster, the media must move faster. at some point the forces acting on the media will rip it apart.
i wonder how the development of magnetically based “flash” storage is going…
You don’t seem to know much about the performance of these languages. Erlang is used all over the telecom industry, because it scales extremely well. An interesting case is yaws, a webserver written in Erlang. There are some benchmarks around which will show you why having parallelism built in is a nice thing.
Erlang is good for telecom, sure. But for solving massive equation systems? Encoding video? Don’t think so. Its floating point performance is terrible.
Haskell is just.. slow. I’ve never seen a fast implementation, and the only multithreaded implementation out there is HORRIBLY inefficient. Not suitable for production use.
As for lisp, I’m not aware of any production-ready multithreaded implementations.
Allegro generates pretty fast code, and has a mature multithreading implementation. SBCL generates numeric code very close in performance to C code, but it’s multithreading system is not fully mature and the GC apparently has some scalability issues.
That said, Lisp (Common Lisp) is an imperative programming language. It’s not inherently any more suited to writing concurrent code than C.
Well then, my point still stands: for high performance multithreaded code in production, functional languages are almost completely useless TODAY. This may change, but I’m not seeing many positive signs…
And when you don’t need performance, you generally don’t need multithreading at all with the fast cores we have today, except as a convenient abstraction as in Erlang. Thus, functional languages are fine there, but are not a general solution to the problem of multithreading.
well if you look at the computer language benchmark, haskell sometimes perform badly (while being still good compare to other language) and some other time it perform better than C.
http://shootout.alioth.debian.org/gp4/
I don’t know how inefficient it is, but considering the fact that it has multithreaded builtin, maybe the inefficiencies are repaid on coding simplicity(+time).
GHC (the haskell compiler) clearly loses most of those benchmarks pretty badly to C/C++.
And no, GHC does not have automatic multithreading built in.
I believed that functional programing were concurrent by design, look like I was wrong. I guess I had to read the the section on concurrency too quickly.
But it says : http://www.defmacro.org/ramblings/fp.html
I think, that’s an advantage
I guess what make the execution speed important is for what kind of software you develop.
Sure, it’s an advantage in theory.
Translating that into production-ready practice with speeds anywhere close to C/C++ has not been done, however, and probably will not happen.
most implementations of Erlang, and some of the few multithreaded lisp implementations, are threaded in name only, that is, they implement user mode threads in the language virtual machine. That’s all fine and dandy until you actually want to use multiple processors… user mode threads inherently must share a processor.
“most implementations of Erlang, and some of the few multithreaded lisp implementations, are threaded in name only, that is, they implement user mode threads in the language virtual machin”
I don’t know what implementation of Erlang you’re thinking of, but the standard Erlang (BEAM) emulator implements a M:N threading model. Erlang runs one OS thread for each CPU core, then schedules Erlang processes inside of those.
But it goes beyond that. Thanks to the shared nothing architecture Erlang processes can be serialized and sent over a network. You can easily use Erlang to program massive clusters of computers, coordinating everything through its database Mnesia.
Erlang is by far the most exciting language in terms of addressing the concurrency problem.
Erlang is interesting, but it has lots of problems. Here’s what Robert O’Callahan had to say:
This is related to another big problem, composability. Suppose I have a big hash table that I want to provide parallel access to. Ignoring resizing for a moment, I can have one lock on each bucket, or equivalently, I can have one process for each bucket in a message passing system. Now suppose that as a user of that library I want to build on top of it a “swap” operation that’s atomic (and performs well), without modifying the library. In a sequential world adding an operation like this is trivial, in a parallel world it’s impossible in both the locking and message passing models.
…
Just pointing out that Erlang (and message passing in general) is not the last word in concurrent programming.
The problem with saying “we’ll just change the library!” is that if you have to change the library whenever an application wants to build something new on top of it, it’s going to be very hard to build big complex systems, you’re not going get much code reuse, and systems will be brittle and hard to maintain.
And yes, of course the problem is very easy in sequential languages. That’s the point. We’re miles away from making parallel programming as “easy” as sequential programming, and if we can’t do that, we won’t be able to write parallel programs that are as capable as the sequential programs we have today.
Suppose I have a big hash table that I want to provide parallel access to.
Use Mnesia. I guess part of the difficulty with Erlang is getting used to its way of doing things, rather than thinking in traditional procedural approaches.
The point wasn’t that it couldn’t be done (of course it can) – the point was that adding a simple extra ability to the code would require modifying the libraries underneath, which means the reusability of code is much lower. That’s a problem when comparing it to the traditional “linear” approach.
What if Mnesia didn’t support it? In C++ I could code the support above it into my application. With Erlang I’d need to either go into Mnesia myself and modify the library or I’d have to look for another one that worked better. Or I’d have to completely write my own. That isn’t optimal.
Not that Erlang is useless or anything. It’s actually pretty cool. It just isn’t some holy grail that will make everything work perfectly.
This is so infuriating. That’s not the problem with software. The nastiest problem in the computer industry is not speed but software unreliability. Unreliability imposes an upper limit on the complexity of our systems and keeps development costs high. We could all be riding in self-driving vehicles (and prevent over 40,000 fatal accidents every year in the US alone) but concerns over safety, reliability and costs will not allow it. The old ways of doing things don’t work so well anymore. We have been using the same approach to software/hardware construction for close to 150 years, ever since Lady Ada Lovelace wrote the first algorithm for Babbage’s analytical engine. The industry is ripe for a revolution. It’s time for a non-algorithmic, synchronous approach. That’s what Project COSA is about.
Intel would not be complaining about software not being up to par with their soon-to-be obsolete CPUs (ahahaha…) if they would only get off their asses and revolutionize the way we write software and provide revolutionary new CPUS for the new paradigm. Maybe AMD will get the message.
Sorry but your example about self-driving vehicles is totally wrong: we cannot ride those vehicles not because of the quality of the software implementation, but because we don’t know how to make such software which is nearly a strong AI.
Sorry but your example about self-driving vehicles is totally wrong: we cannot ride those vehicles not because of the quality of the software implementation, but because we don’t know how to make such software which is nearly a strong AI.
No, I don’t mean a fully intelligent car, although that, too, is coming. I mean a semi-automated, GPS-enabled system with RFID devices embedded in the road or on the curb and RFID sensors in the vehicles. I believe such a system is viable and could fully automate driving in big cities. In addition to automating driving, it could substantially decrease the number of vehicles in the streets. Most cars are iddle most of the time anyway. Big cities could ban private vehicles altogether and give everybody a pager that they can use to summon transportation when needed. The nearest parked automated car would then drive itself to the passenger’s location and take them to their destination.
1) Your system is fragile: any obstacle on the road would create an accident.
2) A totally separated road network would be necessary because pedestrians don’t come with RFID tags.
For the auto designed to drive itself problem, did anybody else see the Nova show about the last grand challenge.
Darpa had a contest to get a vehicle that could drive itself about 130 miles in the Mohave desert along all sorts of terrain and with obstacles in the path following a trail of GPS marks.
First year, all vehicle failed by mile 7 and many failed in the same way, camera ended up pointing to the sky on a slight incline so all image tracking lost.
Second year, 5 vehicles drove themselves all 130 miles to completion.
The most interesting team from Stanford U was an all software team that took a standard Volkswagon already fitted by the company with remote drive controls and they added their own software. IIRC they had about 6 months to do this with a small team but most of the code was by one guy. They won by using software smarts to replace complex mechanically mounted cameras on gimbals. Alot of the software looks like the kind of software you find in a digital camera to do motion compensation plus the extra task of pulling a smoothest possible path to the horizon out of laser rangfinders and stabilized camera pictures in combination. This was there first show in the contest.
The bigger funded teams with defense backing had far more resources and used alot of hardware complexity plus software to do this on big trucks and they still had many people fine tuning the path their tank took, so the spline was hand tuned to maximize vehicle speed, a big cheat I think. They had learn’t from the previous years mistake about unexpected inclines.
The Stanford car clearly took one obstacle very smartly when it drove around the truck when passing it, it made all its own decisions on exact path and speed, obstacle avoidance.
Still I don’t think we will see this in production cars, but the US Army wants to put this automatic driving into production ASAP for obvious reasons.
From other science shows it seems most auto companies no longer invest in this self drive idea and now concentrate on safety smarts to help drivers just drive better.
The technology is getting there, but its still probably decades from effective commercialization. One of my friends was on the CalTech team for this project for several years. Despite several years of work, backing from JPL, etc, the result was still squarely in the “first steps” category. Their van, which performed well in trial runs, decided to try and scale a knee-high barrier in the main race. Not something you want your SUV doing in the road!
In order to effectively commercialize self-driving vehicles, the “random f–kup” rate has to go way down, and the versatility (being able to drive in terrain you’ve never seen before) has to go way up. To do that, synthetic vision systems have to get orders of magnitude better. Pathfinding systems have to get much better. Hazard detection and avoidance algorithms have to get much better.
It’ll get there eventually, and the Grand Challenge is a great step in that direction, but I’m not convinced that even in the next 20 years we’ll have computers that can drive as well as your average 16-year old girl…
< as well as your average 16-year old girl…>
Thats setting the bar pretty low, maybe set it even lower like Lindsey Lohen driving her Merc into a tree.
Still the show was pretty interesting to watch, I agree that we probably won’t see commercial self driven personal transport for safety reasons, and it probably isn’t even necessary except for military use in hostile conditions.
This is so infuriating. That’s not the problem with software. The nastiest problem in the computer industry is not speed but software unreliability. Unreliability imposes an upper limit on the complexity of our systems and keeps development costs high.
To a point I agree with you, but I don’t believe its because the current model is flawed, more that the way in which many of us work with the model can be flawed. The complexity and unreliabity we largely created ourselves in the PC realm, not all software is unreliable.
Look at something such as the Apache Attack Chopper, very complex system but the avionics control systems are very reliable.
We could all be riding in self-driving vehicles (and prevent over 40,000 fatal accidents every year in the US alone) but concerns over safety, reliability and costs will not allow it. The old ways of doing things don’t work so well anymore.
I do not think we have the computing power, or systems designs to make a self-driving car. The AI is just not there yet.
The industry is ripe for a revolution. It’s time for a non-algorithmic, synchronous approach. That’s what Project COSA is about.
I agree but I’ve been to the website and a forum with a few rambling posts isn’t going to be the revolution imho.
Hehe!
I don’t want to troll, but lowering power consumption to
keep the core cool, isn’t exactly new.
The PowerPC has been about this for many years. They
were also way ahead of x86 when it comes to multicore..
Freescale’s desktop-suitable offering is the 8641D which
derives from the G4 74XX series. The D is Dual -ore and
with just a few tens of watts you get 1.5 GHz per core.
IBM went the more powerful route with dual-core G5 CPUs
(970 series). Apple used these CPUs (970MP) in the last
PowerMacs.
AMCC just announced their Titan SoC with 2 GHz and just
consuming just 2.5 watt per cores. P.A. Semi are working
with their PWRficant series with dual-core, 2 GHz with a
max consumption of 25W.
http://investor.amcc.com/releasedetail.cfm?ReleaseID=244596
http://bbrv.blogspot.com/2006/11/watts-happening.html
sadly, the powerpc just proves the idea that the best product will eventually win as false.
that is unless the X86 will suddenly roll over and die.
hmm, that happening would be quite the IT cataclysm…
The Titan SOC isn’t in the same league as any of the other designs we’re talking about (it’s a dual-issue embedded chip based on the IBM 440). The 8641D post-dates the first dual-core x86’s (the Athlon X2) by a long time. So does the 970MP. The only PPC to beat x86 to the dual core game was the POWER4.
We are seeing multicore CPUs only because Intel/AMD have FAILED to scale clockrates much more without degrading performance per clock and excessive heat waste.
Two CPU in a computer have a merit of making everything running more smoothly but 8+ cores in an usual desktop? I’m not going to buy that.
It’d be better to use chips’ space for dedicated coprocessors for audio/video tasks or a bigger cache imho.
and isnt that what AMD have been talking about lately?
what was that project name again for the CPU with on-board GPU?
hell, its that idea (dedicated co-processors) that have been powering the data-server models from IBM and similar, right?
as in, each storage controller having its own processor so that the CPU can just hand of a data transfer to one of them and go back to doing other things.
thats why a PC crawls to a near halt at times when there are heavy data transfers to be done alongside a heavy CPU load. the CPU is also managing the data transfer to some degree.
Great input Intel, seems everyone here says this is hard enough.
Well then, here’s a new mission for you!
How about aiming for the title “Most calculations/watt”… I know Sun offers that with their CPUs…. now wouldn’t that be something for you to bite in?
That’d also mean I can lower my electricity bill both for the computer AND for the coolingsystem in my house!
This is just another marketing from the CPU manufacturers. They need more software to depend on multicore to sell multicore. Simple as that, it’s the old golden “make them depend on us” rule. Most programs don’t really NEED multicore (as I said in another post)
Altivec doesn’t make it any easier to use multiple cores, but thanks for playing.
Oh, but the mere fact that its associated with PowerPC magically pervades and makes better all of its other features, no matter how unrelated they are to ISA!
But multithreading isn’t platform-specific. For example, GCC just added OpenMP support which should benefit all multiprocessor platforms.
I think some of these chipmakers need to slow down and allow the demand for their products and software development to catch up.
Their stock will tank if they do that. Their job is not to satisfy the computing demand that exists, but to create demand for products that they know how to build.
I find it interesting that Intel declare that software engineers must start increasing the levels of parallelism they exploit when the architecture Intel is offering is inherantly serial – doesn’t this seem a bit hypocritical to anyone?
On another note this is an extremely complex problem to solve in scalable way.
For a start people need to learn how to program in ways which arn’t fundamentally serial (i.e. every imperative language out there).
Then there needs to be architectures, i.e. data flow, which can exploit these new methods.
For Intel to starting tacking on parallelism to a 20 (more?) year old ISA and say: software engineers, the ball is in your court, start exploiting the parallelism we’re offering … is never, ever, going to work.
Since the one OS that pretty much FORCED you to follow all of the above went the way of the Dodo, being either semi-illegally forked (YellowTab) or still in the uber-beta rewrite (haiku)
It’s called pervasive multithreading – BeOS FORCED you to have every application have multiple threads, and the bottom line is MOST application programmers couldn’t wrap their brains around it and learn to use it… It’s the leading cause for BeOS being an application wasteland even at it’s peak – AND why the handful of ports like Firefox run like crap. (since they end up in a wrapper to try and fake the multiple threads into behaving like a single threaded app)
The only people who DO use it for desktop programming, or try to use it on other platforms, and at least have programs where it makes sense to use it – are game developers. They’ve been running ‘world calculations’, ‘input’ and ‘rendering’ separate from each-other for a while, even if they don’t actually run them in separate threads, they do task them internally. (hence the reason frame rates can drop but speed at which the game goes by doesn’t)
The Actor model is the solution: each object is an actor, it has its own thread, it accepts messages, the object’s thread wakes up and executes the message. Messages are buffered, and the thread keeps going as long as they are messages in the buffer. Return values contain a wait condition which is signaled when the result is ready; the caller waits on this wait condition to continue the computation.
With this model, whatever parallelism is in the program surfaces automatically. This model favors lots of cores, but performance gain could be linear with the number of cores.
This is essentially the same thing as in hardware simulation, if you think of processes with ports as the object model, then the event time wheel (or scheduler) moves data between the outputs of done processes to inputs of waiting processes. It can also be spread over multiple processors too in high end versions, but most hardware simulators tend to run on 1 processor.
In hardware languages like Verilog, these processes are called modules, the ports are in, out and inout. They tend to have very limited data types, ie uints of absolutely any width up to 16M bits wide but mostly small fields. These ports are connected through wires. The internals of modules are mostly continuous assign statements with arbitrary C like expressions which change outputs as their inputs change which reflect logic structures pretty well. What HDL languages add is a great level of timing detail to make the simulation more closely follow real logic timing although this has become moot in cycle based design with synthesis doing most of the logic design.
To speed up the simulator, some constructs can be used such as sequential expressions that have no hardware equivalent but can model the continuous assign at much lower cost, these have to be synthesized into equivalent logic.
It is also possibly to use C as a hardware language but such simulation environments tend to be hard to use except in fully synchronous systems.
I could imagine an extended C++ class as a process with ports added for the same sort of event driven wheel. Such a language would be superficially similar to Verilog and could be used to synthesize hardware processes (module or cells or blocks etc) and could also be used to build concurrent apps too.
On BeOS, I have come to think of the BLooper object as the same as the process model above waiting on events.
If you think of a BWindow as a piece of hardware to send graphics commands too, then the rest of the application can be thought of as communicating blocks. Since many programs follow the Model View Controller scheme, its a natural to use a Blooper or process for each of those as a way to describe how the thing works. Each of those can be further broken down into smaller nested processes for parts of a Window.
I think that there is a lot of reinventing of the wheel here (the event wheel that is), and I see similarities in Actors, HDL simulation, BLoopers, and possibly in Erlang (if it can be used to describe telecom apps) as well as functional and dataflow languages. However I really need to keep up in more detail these other schemes.
Sorry, its too late but this famous Wirth’s article
“A Plea for Lean Software”:
http://cr.yp.to/bib/1995/wirth.pdf
(written over 12 years ago!)
would be a good introduction to this interesting discussion. Maybe, before we start analising the complicated problem of the paralelism, we should ask why the software is so bloated?
Mark
Or what about E. W. Dijkstra’s complaint about the state of software engineering some 30 years ago.
It seems that Intel believes that every process is naturally parallel. While that might be true for some domains of software, it is not true for all. Also, it must make sense for possibly parallel software to be converted from serial to parallel. The cost in both time and money is still too high for a lot of applications. Finally is parallelism the individual applications job? The OS’ job? Both?