After having an interesting discussion with Brendan on the topic of deadlocks in threaded and asynchronous event handling systems (see the comments on this blog post), I just had something to ask to the developers on OSAlert: could you live without blocking API calls? Could you work with APIs where lengthy tasks like writing to a file, sending a signal, doing network I/O, etc is done in a nonblocking fashion, with only callbacks as a mechanism to return results and notify your software when an operation is done?
I have already worked with some systems implemented that way. In the first moment they seem weird and it takes some time to get used to them, but afterwards it start to make sense. One advantage is that it seems to be easier to identify performance bottlenecks (you will end up having a queue of pending operations, so you can easily monitor these queues and apply all math from queueing theory classes).
Other day I stumbled upon ZeroMQ, which uses a kind of mixed approach. All writes/TX is async (queued and handled by a background thread) and read/RX may be sync or async, depending on configuration.
I’ve been researching just the same thing. Such a system would allow for transparent inter-process communication across networks, making way for a new era of distributed computing.
If anyone’s interested, look into the E programming language.
Particularly in the case of a semaphore or mutex.
One function of blocking is to give up the CPU when you can’t do anything, and to queue for a resource.
It would turn simple code into something like the following, but the pastabilities are endless:
gotit=0;
waitforit() { gotit=1 }
func() {
…
getit(resource,waitforit);
while(!gotit) sched_yield; // heat-up, drain battery,etc.
Of course there might not be callbacks, but check-if-done functions,
while( !areWeThereYet(destination) ) burn();
Of course if it is truly asynchronous, the callback function will have a parameter specifying the callback which needs to be called when the first callback is completed.
I could live without blocking, but I could also live in a post-apocalyptic world as well, but it wouldn’t be any more easy, simple, or pleasant.
I think you’re missing how async is done in modern languages. For instance the entire IO API in silverlight is implemented via async calls.
So instead of actually warming the cpu in a loop, instead you provide a callback where execution will continue. Semiphores and mutexes could be coded in the same way (using the () => C# lambda syntax:).
DoStuff()
mutex.Aquire( () => {
DoMoreStuff();
});
In C# 5 there are extensions for doing this in a simpler manner (something like this):
DoStuff()
await mutex.Aquire();
DoMoreStuff();
Basically this is just syntactic sugar around the first example. I don’t think any serious async system uses loops they instead hand off the async callback handling to other VM or OS functions.
One approach to reduce the amount of semaphores/mutex is to use one thread for each group of resources you may need to lock. Then, as you know that just a single thread modifies that resources, you don’t need mutexes. You can adjust the scalability for multiple cores by changing the granularity of the resource groups (smaller groups need more threads).
Each of these threads have a queue of work to be done and you’ll have to model your “work” as structs/objects that can be stored in a queue.
So, instead of having:
Component blah:
…do some_stuff with blah;
…wait(lock_for_foobar);
…do X with foobar;
…release(lock_for_foobar);
…do more_stuff with blah;
You will end up with:
Blah.some_stuff:
…do some_stuff with blah;
…enqueue “do X and then notify Blah” to Foobar
Foobar.X:
…do X with foobar;
…enqueue “X done” to Blah
Blah.X_done:
…do more_stuff with blah;
This way you can get rid of mutexes and use only async calls.
Empirically, this approach seems to have larger latency but better throughput. Anybody have seen performance comparisons among these kind of programming models?
Edited 2011-05-27 14:06 UTC
Performance really depends on the exact implementation.
In my (private)LoonCAFE project I use an almost fully async API ( with optional blocking calls ).
Performance is actually significantly better with callbacks and signals overall and the applications feel much better – especially when things really get heavy.
It would seem that latency would increase but I have not experienced that at all. The overhead is extremely minimal for callbacks and signals, being no more than the overhead of calling a function – or a simple loop in the case of a signal ( signals really serve a different purpose, though ).
Which is to say, not much in my implementation. An extra ‘call’ ain’t very expensive.
–The loon
Yes, of course.
Porting things would be more annoying, though.
Fair point, but in the case of this OS project, I mostly aim at showcasing a new, (hopefully) better way to do things, not so much at competing with existing OSs, if that is even possible.
So if I manage to build something that works fantastically well and that could, given some love, become a powerful alternative to existing OS, then I consider that I’ve succeeded, even if in the end the project never gets that love because it lacks drivers or because porting software is too hard.
Edited 2011-05-27 14:41 UTC
We’ve been doing this in Erlang for years. All processes are entirely separate and communicate asynchronously. Synchronized communication can also be achieved by waiting for a response and monitoring the process. All this is already available in the standard library and you just have to write functions like you would in a sequential language, despite everything executing in parallel.
+1 for Erlang!
+11 for Erlang!
Kochise
Been doing it in Silverlight for years. Blocking calls are pretty ancient by any modern developers standards.
Is there some added complexity? Yeah. Do you lose some finer control over execution flow? Probably. But its really nothing a few signals cant fix.
You gain a lot more in scalability though.
Blocking calls are still quite common. Just because on the very next line you know you have the data or an error/exception. Mostly blocking API is non-blocking API wrapped and synched to the response.
Edited 2011-05-27 15:52 UTC
That’s the only way things ever work properly in the real world, asynchronous event-based message-passing systems. I don’t ever use blocking calls. I always opt for non-blocking calls. If anything, it forces me to write robust code.
I’ve noticed the wild assumptions programmers make when writing code based on blocking calls is monumental. The obvious dangerous assumption is that the call will complete successfully, if at all! That’s where the cards begin to fall because if that call does not complete, for 10 million and 1 possible reasons, your program just hung indefinitely or crashed.
I learned the hard way. Anyone, who has done any kind of desktop application development probably has an endless list of war stories about the perils of blocking calls.
However, I know many developers (usually with a background in web development) who hate callback based programming. They claim it’s hard to understand.
Actually most people working on ‘frontend development’ (javascript, html, css and so in the browser) are already wired to doing it the event-driven way and they like it.
It is how it was intended to be used in the browser, it’s is probably better to not fight it. There is already enough fighting with browsers.
Like Lennie said, that’s a weird thing to say. Web developers, of all people, are most accustomed to dealing with async behaviour. Javascript is a fundamentally single-threaded event-driven language, and it’s been Javascript developers (via frameworks like Node.js) that have been pushing the async model into server side code in recent years…
It’s still a novel idea in the web arena. Async event-based programming has been the way to program desktops apps for about 2 decades. Web app development is still predominantly synchronous. Async event-based web frameworks are the exceptions, not the norm, in web programming. Remember Javascript only started being abused a few years ago with the hype that “Ajax” spurred.
It is actually pretty hard to live without it, even something like node.js has (I think it was) 250 threads to handle any blocking calls.
There are still some things in the operating systems which hinder doing this everywhere.
Yes, people wrap code around it to mask it in the API/language, but it would be better if it could be done proper.
Actually this is a very old topic. John Ousterhout gave an excellent talk about multithreading versus event handling in 1996: John Ousterhout, Why Threads Are A Bad Idea (for most purposes), Invited talk of the USENIX Winter Technical Conference, Jan, 1996, http://www.csd.uoc.gr/~hy527/papers/threads-ousterhout.pdf.
Another interesting paper is this one: Rob von Behren, Jeremy Condit, Eric Brewer, Why Events Are a Bad Idea
(for High-Concurrency Servers) (2003), 9th Workshop on Hot Topics in Operating Systems (HotOS IX), 2003,
http://capriccio.cs.berkeley.edu/pubs/threads-hotos-2003.pdf.
“Actually this is a very old topic.”
Yes, thanks for the links. Interesting.
In the second paper, I give them credit for seriously addressing the criticisms facing MT programming. However I feel they failed make a compelling case.
“Criticism: Many attempts to use threads for high concurrency have not performed well. We don^aEURTMt dispute this criticism; rather, we believe it is an artifact of poor thread implementations…”
Instead of OS threads, they suggest using user-space threads, but it’s a bit of a cop out since it ignores the distinction between blocking IO and heavy CPU utilization.
“Criticism: Thread synchronization mechanisms are too heavyweight. Event systems often claim as an advantage that cooperative multitasking gives them synchronization ‘for free,’ since the runtime system does not need to provide mutexes, handle wait queues, and so on. However, Adya et al. show that this advantage is really due to cooperative multitask-
ing (i.e., no preemption), not events themselves; thus,
cooperative thread systems can reap the same benefits”
First, I disagree with the statement that the async model is not responsible for eliminating synchronization, without a doubt it is because async callbacks are inherently atomic. However I agree that non-preemptive threads will eliminate the need for synchronization in certain threads since that code would also be executed atomically on a single CPU.
However I don’t think either of the assumptions will generally hold true with modern MT programming.
“Criticism: Thread stacks are an ineffective way to manage live state. Threaded systems typically face a tradeoff between risking stack overflow and wasting virtual address space on large stacks. Since event systems typically use few threads and unwind the
thread stack after each event handler, they avoid this
problem. To solve this problem in threaded servers, we
propose a mechanism that will enable dynamic stack
growth…”
Ok, I’m thinking maybe they want to garbage collect unused stack pages, especially on 64bit systems where virtual memory addresses are abundant…
“Using a compiler analysis, we can provide an upper bound on the amount of stack space needed when calling
each function; furthermore, we can determine which call sites may require stack growth. Recursive functions and function pointers produce additional challenges, but these problems can be addressed with further analyses.”
Are these guys serious? In the async model, this problem is naturally avoided. If this is the best argument these guys have, then async clearly wins on the stack argument.
They don’t specifically talk about the race conditions that come about with MT except to say that compilers should be improved to improve thread safety.
“This result shows that a well-designed thread package can achieve the same scaling behavior as a well-designed event system”
In the end, they can only demonstrate that their MT scales the same as the async code. However…
“The performance degradation for both Knot-A and Haboob is due to the poor scalability of poll(). Using the newer sys epoll system call with Knot avoids this problem and achieves excellent scalability. However, we have used the poll() result for comparison, since sys epoll is incompatible with Haboob^aEURTMs socket library.”
From my reading, it sounds like they did not use the more scalable epoll mechanism for either the threaded or async tests. Obviously this will have a greater negative impact on the async benchmark since poll passes in thousands of sockets per call. This would give their MT version a huge unfair advantage.
Does anyone else read it that way? The epoll version is notably absent from the graph.
Is it possible they deliberately left out an async epoll version because it outperformed their MT version?
Edited 2011-05-27 17:15 UTC
This is the whole point of the Node.js programming environment, which has been very successful.
I often prefer to use blocking calls on a background thread. For example, if I have a lot of operations to perform on a slow resource like disk or network. It’s easier to spin a new thread that pulls these requests out of a queue while other threads enqueue them. For a file or a socket you can only write from a single thread at a time, and order matters. Using a blocking call on a background thread makes it easy to ensure only one write is occurring at a time.
I don’t really want to repeat what has already been discussed on the very recent threads about interrupts.
The question doesn’t specifically address the deadlock concern in the blog post however, which is acquiring a lock before some call, and then holding it until after the call completes.
Call stack:
driver A (locks mutex a)
– driver B (locks mutex b)
— driver C
— driver D
—- driver A (deadlocks on mutex a)
We could detect the circular deadlock/livelock condition and attempt to recover. This is not really trivial. If the OS could infer which threads owned which mutexes, then it could theoretically raise an exception on the call which is responsible for the deadlock condition (so that the second mutex acquire of a could fail, and the second instance of A could return an error).
In reality though, the OS does not know which thread holds a lock since the application may legitimately have passed it to another thread, making it possible for another non-blocked thread to come in an unlock the mutex – it is program specific knowledge.
We could try various things like timing out a mutex acquire (in 2nd instance of A), but this results in unexpected delays and/or failures for the user.
Solutions:
1. We accept that A and B cannot be called recursively, and leave it to the individual drivers to enforce.
2. We add some kind of program specific recursion detection mechanism and return an explicit error.
3. Make A and B re-entrant. Modify the algorithms in A and B to work without a mutex (or at least no mutex across calls to other drivers).
4. We convert the calls to A and B to become queued requests + callbacks (async).
If it’s on another thread then that is not in the same call stack, is it? Otherwise that problem has been successfully mitigated using reentrant mutexes, such as used in Java.
But the problem is really bad where you use sub-threads that access a shared resource.
JAlexoid,
“If it’s on another thread then that is not in the same call stack, is it?”
I think the problem was brought up in terms of one stack, but the situation could happen with more.
“Otherwise that problem has been successfully mitigated using reentrant mutexes, such as used in Java.”
I had never heard of this terminology.
A “reentrant mutex” allows a single thread to acquire a mutex multiple times from the same thread without blocking.
However I don’t think it solves the problem.
Presumably mutex a is being used to protect a structure which is in an incomplete transitionary state (which would make it detrimental for a 2nd instance of A to continue without blocking).
On the other hand, if we know that A is reentrant and it is safe to let other drivers call A again, then the solution would be to just release the mutex before calling the other drivers.
“But the problem is really bad where you use sub-threads that access a shared resource.”
For illustration…
Thread 1:
Driver A (acquire reentrant mutex a)
– Driver B
— Spawn thread 2
— Misc work
— Join thread 2
Thread 2:
Driver B
– Driver A (deadlock on reentrant mutex a)
– exit thread
This scenario isn’t all that unreasonable. I’m not certain where a reentrant mutex would be beneficial? Certainly it could be emulated using thread local storage.
Well you then know not much about how Java and C# synchronisation mechanisms work. If you’re into Java I suggest watching some of Cliff Click’s tech talks on concurrency and memory model.
Reentrant mutex or lock works by allowing the same thread to acquire it more than one time. Also requires it to be released the number of times it was acquired. It essentially has the owning thread + counter.
JAlexoid,
“Well you then know not much about how Java and C# synchronisation mechanisms work. If you’re into Java I suggest watching some of Cliff Click’s tech talks on concurrency and memory model.”
Well that’s just not fair, your trying to make fun of me and I’m trying to have a serious discussion… I won’t go to your level.
“Reentrant mutex or lock works by allowing the same thread to acquire it more than one time. Also requires it to be released the number of times it was acquired. It essentially has the owning thread + counter.”
You’re echoing what I had said… in any case, it wasn’t applicable to the problem being discussed.
And I take it you don’t plan to address that gap in your knowledge?
JAlexoid,
“And I take it you don’t plan to address that gap in your knowledge?”
What gap in my knowledge? I didn’t know a certain keyword, I looked it up before responding back. I learned something, so I thank your for having brought it up.
But you are a vicious troll.
Damn. Here I was expecting a nice long discussion on call blocking services, mobile apps for blocking numbers, and whether or not they actually serve a purpose.
Yeah expecting an operating system article on OSAlert … and yet you have a point
Heh, when I read the headline, I thought this article was about something entirely different.
Meh – Amiga programmers dealt with that more than 20 years ago. It’s no big deal to an Amiga programmer. Now DOS/Windows or Mac? Forget about it.
Yes, absolutely, and I think non-blocking API calls don’t have to add complexity to software if done right. In fact, they can make things much simpler.
Also, I love the concept of blocks and Grand Central Dispatch that Apple has in Mac OS X and now in iOS as well.
GCD and async I/O go well together.
So could every kernel/embedded programmer, because that’s how those environments have worked just about forever and the closer you get to the hardware the more true that becomes. The problem is that you don’t always control what APIs you need to use, and even in the systems world many people still provide blocking APIs. You can either reinvent whatever complex thing they’re doing, rip it apart to add non-blocking calls, or spin off a thread that can afford to block. It’s ugly, but no uglier than reinventing the wheel – especially when that wheel is doing something specific to a proprietary black box somewhere or is a proprietary black box itself. Any decent framework should encourage async styles of programming (which isn’t to say lame single-dispatcher-thread crap) but allow for sync where it’s necessary.
This type of asynchronisity becomes easy to handle with the concept of the “future” primitive.
http://braddock.com/~braddock/future/
braddock,
“This type of asynchronisity becomes easy to handle with the concept of the ‘future’ primitive.”
That is very interesting, it’d be worth having it’s own article.
We leave variables undefined in algebra all the time, however the concept of not-yet-defined variables in a language like C++ is intriguing…
I have to ask, is this functionality different from lazy evaluation in lisp-like languages?
Edited 2011-05-28 02:39 UTC
Hi,
My preference is threads (with potentially large “Thread Local Storage” areas); where each thread is responsible for a specific set/s of data, where only that thread is able to modify its data; and all threads communicate via. asynchronous/non-blocking messages.
For a simplified “phone book” example; imagine you’ve got a list of people’s names and a phone numbers. You spawn one thread per CPU, and divide the list into sub-lists – e.g. for 4 CPUs you might have one thread for “A to E”, one for “F to M”, one for “N to R” and one for “S to Z” (in practice you’d use hashes but I’m trying to keep it simple). If you want to find a person’s phone number you ask the thread responsible for that part of the sub-list; and if you’re adding a new entry you tell the thread responsible for that sub-list to do it. If you want to search the list (maybe find all people who’s names end with “e”) you end up doing a parallel/multi-threaded search (and then combining the results from each thread). The main idea here is to balance the majority of the load across available CPUs.
Once you’ve got this happening for one set of data, you could reuse the same threads for completely unrelated sets of data. For example, one thread might handle phonebook entries for “A to E”, plus keep track of vehicles who’s registration plate ends with the digits “0 to 3”, plus handle all bookings where “bookingID % 4 == 0”. The main idea here is to avoid thread switches (after the majority of the load is balanced across CPUs); so that (under heavy load) you end up with one thread per CPU going flat out (with no task switches, no blocking, no waiting to acquire locks, etc).
Because none of the data is shared (other than “read only” data) there’s no locking needed (which can make it a lot simpler to use), it easily adapts itself to any number of CPUs, and it tends be “naturally good” for data locality issues (tends to be efficient for CPU caches, etc). For all these reasons, it’s relatively easy to get extremely good scalability out of it.
The problem with this sort of approach is that it requires special design – it’s not “procedural” and it’s not “OOP”. This means you can’t port software to this type of system, the normal design tools most people are familiar with (e.g. UML class diagrams, flowcharts, etc) aren’t well suited to describing it, and the number of programmers familiar with it are a lot less.
What makes asynchronous/non-blocking messages more complex is the restrictions and the guarantees the system makes. How much data can be sent in a single message (4 dwords, 4 KiB, 4 MiB)? Are sent messages guaranteed to be received (common example of “no” would be network failures in distributed systems)? What guarantees are there for the order messages are received (none, some)?
– Brendan
Brendan,
“What makes asynchronous/non-blocking messages more complex is the restrictions and the guarantees the system makes. How much data can be sent in a single message (4 dwords, 4 KiB, 4 MiB)?”
Why need there be any limit at all, short of memory constraints or policy?
“Are sent messages guaranteed to be received (common example of “no” would be network failures in distributed systems)?”
Locally delivered messages can be guarantied, network messages would depend on the underlying protocols – but how is this any different between threads/async?
“What guarantees are there for the order messages are received (none, some)?”
A queued async model doesn’t get out of sequence messages, so I’m not sure why your asking this question?
Hi,
There doesn’t need to be any hard limit.
In practice, if the limit is too small you end up splitting data into many messages and combining the pieces at the receiving end (and possibly pounding the daylights out of the kernel); and if the message size is too big (or unspecified/unlimited) it’s hard to tasks to reserve space for receiving messages, and you risk running out of RAM (e.g. all your free RAM gets tied up in messages in queues).
There’s also compatibility issues. For example, imagine a 64-bit kernel capable of running 64-bit processes and 32-bit processes. With no limit, what happens when a 64-bit process sends a 12 GiB message to a 32-bit process?
Locally delivered messages may not be guaranteed to be received. You send a message to a task, it’s placed in that task’s queue, then that task terminates or crashes before it gets your message from its queue.
It’s using asynchronous messaging and multiple threads; so I’m not sure if that’s different to whatever you think “threads/async” is…
Imagine 3 tasks. Task A sends a message to task B and then sends a message to task C. When task C receives the message from Task A, it sends a message to task B. Can task B receive the message from task C before it receives the message from task A?
In some cases the answer is “yes” – mostly when there’s some sort of intermediate buffering/queuing happening before the message reaches the final task’s queue (e.g. in networking layers).
– Brendan
“if the message size is too big (or unspecified/unlimited) it’s hard to tasks to reserve space for receiving messages”
In my own async library, read calls return a buffer of sufficient size, so the recipient doesn’t need to make any assumptions. But your right it is a problem otherwise.
Your previous post indicated that the message size is more problematic for async compared with threads, I still don’t understand why?
“allocating buffers belongs to, and you risk running out of RAM”
This is possible, yes, but then the OS should be throttling IO as needed.
“With no limit, what happens when a 64-bit process sends a 12 GiB message to a 32-bit process? ”
Fair point, however I’d say this fits under the “limited by memory” constraint. Either way, the same problem exists whether async or threaded.
“Locally delivered messages may not be guaranteed to be received. You send a message to a task, it’s placed in that task’s queue, then that task terminates or crashes before it gets your message from its queue.”
It could be guarantied in the same sense as a postmaster guaranties delivery of certified mail to someone at the house. If the recipient subsequently dies before reading the mail, that’s not the postmaster’s problem. If this answer is problematic, then it shouldn’t be a problem to implement an inband response.
Anyways, my response was to your post about failed network delivery.
“Can task B receive the message from task C before it receives the message from task A?”
Yes, but this is an application level detail. I know of no protocol stack (threaded or async) which guaranties the ordering of messages between three parties in the way you’ve described.
However, in theory, it would be possible for an async IO application/library to defer delivery of C->B until A->B is delivered.
Also, you’re mention of “tasks” gives me the impression that you’re envisioning some kind of CPU bound computation, which I do advocate using threads for.
Hi,
No – I only said that (bad) choice of message size can make async more complicated (than async with a different/better choice of message size).
Ok – it fits under the “limited by memory” constraint; but “limited by memory in the all supported systems” rather than “limited by memory in the current system”. For example, if the OS supports systems with as little as 16 MiB of RAM, then you’d probably want to set the maximum message size to 8 MiB or less, even for large systems with 123 GiB of RAM, to make sure software designed for the OS works as designed on all supported systems.
I use “tasks” as a generic name for “processes, threads, fibres or whatever else might make sense for the OS or for the context being discussed”..
Someone posted links to a few papers that explore “threads vs. events” (or “theads vs. messages”). Don’t let them mislead you – those papers are stupid, and sane people would want both (not one or the other).
– Brendan
When implementing a callback-based API be careful not to fragment the callbacks too much.
Say, user starts an operation on some large data structure and requests notification when that data structure changes. Perhaps the callback function does some expensive tasks, like updating the GUI etc. The operation itself requires a lot of atomic operation to be performed on the data structure, thus it generates a lot of notification events.
If you implemented this naively and update the user after every changes to the data structure (1), the user function would needlessly be called multiple times, consuming a lot of CPU time.
If you notified the user only at the successful completion of the whole task (2), then the user wouldn’t be at all aware of the progress of the operation he started.
What you probably want to do instead, is to collect the notifications like in (2) and emit user code notifications periodically + immediately after the whole operation has finished. This probably requires some concurrency in your code but it can be done cooperatively if your main loop supports timers and idle events.
I don’t understand: why would the code require a notification each time a data structure changes after having started a big operation on it, instead of asking for a notification once the big operation is over? More important, why would they put a very intensive callback function on something as fine-grained as changing a data structure ?
If you can prove that this is a valid use case and not poor programming practices at work, a feature that gives notifications a “cooldown time” could indeed be the right answer.
Edited 2011-05-28 09:52 UTC
Well, that’s a simple MVC pattern at work, which is also a kind of an asynchronous framework. I wouldn’t call it a bad programming practice, quite the opposite. It does what it is designed for – separating concerns. The idea is that you do what you want to do on the Model and leave the task of updating the View to the framework.
And it works well too, if only you remember not to use the naive implementation for anything but the simplest models.
Point taken
(Partly because of your explanation, and partly because I’ve also thought that preventing a notification to be fired at fast rates prevents local DoS attacks on things which don’t need extreme reactivity)
@ Original Article,
There are actually alternatives to just plain old threaded / async / semi-async (you defined threaded + threshold == semi-async). Non-blocking algorithms exist of a more radical type. I have read that there are people who have successfully implemented system-level stuff using unconventional non-blocking mechanisms, most notably compare-and-swap http://en.wikipedia.org/wiki/Compare-and-swap
This is actually very useful, especially in conjunction with client-server model. It would live somewhat more like async-threaded rather than threaded-async that you are looking at.
Let’s use the client server model analogous to the X-org model. The owner of the resource, the graphics card, is the server. Within this server, there is only 1 point of I/O. This I/O point is the server’s queue number. All clients wanting to perform any operation will first formulate an atomic transaction. Then, it looks at the queue number and attaches it to the transaction. It would then try to compare and swap the queue number with the server’s: if the number had changed in the meantime, then the server had just accepted a transaction from another client, and it needs to try the queue number thing again. Only if it manages to update the queue number will it send the transaction details to the server for the server to process at its leisure.
The original paper (it’s been years since I last saw this, and I cannot find it anymore, I guess) says that, even without the compare and swap instruction, this method of setting things up tend to be very simple yet gives a lot better performance than it looks. Collisions are rare even in busy environments. Also, because the resource has its own monolithic server, it can look ahead in the queue and apply sensible algorithms, especially against deadlocking-style problems. Due to its atomicity, it is actually very useful for databases too.
One thing that this really requires is that the malloc and memory-cleaning implementation be amazing. This is because the compare and swap action is most naturally implemented as a linked list variant updating mechanism.
There is no reason whatsoever that, once the client obtains the queue number properly and securely, that the server cannot establish the callback to notify the client of progress. It should be done too. The server should also implement its own internal consistent history of transactions — should it decide to reject transactions, something must have the unarguably correct state of the system.
The above manages to implement an async model (sequential handling, even) that deals well with threaded requests. In the appearance of the clients, it is as if it is threaded, but actually implements centralised database-style handling. For really busy systems, this server can then be threaded later. It is rather difficult to avoid a choke-point for this server, but given multi-tasking, the choke-point is actually a plus because we can apply load-balancing algorithms there.
Most cpus have the ability to do either of CAS or LL/SC, with LL/SC much better than CAS, although when push comes to shove, CAS with a fixed algorithm to do it can be good enough (just without the security). Damn legacy x86 with its idiosyncrasies.
xiaokj,
“There are actually alternatives to just plain old threaded / async / semi-async (you defined threaded + threshold == semi-async). Non-blocking algorithms exist of a more radical type. I have read that there are people who have successfully implemented system-level stuff using unconventional non-blocking mechanisms, most notably compare-and-swap”
I don’t know why you call these unconventional, atomics are pretty common for implementing thread safe algorithms which don’t require OS level synchronization. Many compilers (GCC) have support for them.
It’s an extremely fine grained serialization (at the system bus level). I use them often in MT code, however they’re still expensive since they cannot run at CPU speeds.
I’m not disagreeing with your post, however I’d like to add that atomic cpu operations are only necessary in MT code, an async or other single threaded model doesn’t need to sync because the possibility for overlap with other threads doesn’t exist in the first place.
Alfman,
I would first like to declare that I am speaking out of my area of expertise. It should be obvious that I am out of touch, so if it is not unconventional, then it should be my error.
However, I find it funny how you say that async model does not need to care. If you have multiple cores, then even if you dedicate one core to the server, then if, in the rare case, two other cores decide to request something, then there will be a need to care, no?
On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC, are hardware accelerated to be (a) non-blocking and (b) interrupt-enabled and (c) runs in one cycle. Why do you claim that they are slower than CPU speed?
Nonetheless, if you combine atomicity and MT, I cannot foresee why a good implementation will not outperform simple threaded and/or async as described. It would be capable of doing all of those, and be well-balanced at the same time.
xiaokj,
“I would first like to declare that I am speaking out of my area of expertise.”
I don’t mind in the least.
“If you have multiple cores, then even if you dedicate one core to the server, then if, in the rare case, two other cores decide to request something, then there will be a need to care, no?”
Consider a process like a web server using async IO mechanisms. From one thread, it waits for connections and IO (with a single blocking call to the OS, like epoll_wait). From there, as requests come in, it dispatches further IO requests on behalf of clients but since the async thread only schedules IO and doesn’t block, it returns immediately to waiting for more IO. From the userspace perspective, there are no threads nor mutexes needed to implement this model.
One simple way to achieve more parallelism is to run two or more instances of this async application, and there still would be no need for userspace mutex since they’d be running in different processes.
“On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC, are hardware accelerated to be (a) non-blocking and (b) interrupt-enabled and (c) runs in one cycle.”
Firstly, my assembly knowledge drops off significantly beyond the x86, so I can’t generalize this to be true for other architectures.
(a) non-blocking is true in the threaded sense, but false at the hardware level.
(b) I don’t know what you mean, but atomics don’t interfere with interrupts.
(c) Runs in one cycle? Where do you get that from?
“Why do you claim that they are slower than CPU speed?”
The CPU exerts a lock signal which prevents other CPUs from using the bus, like a tiny mutex. Since this takes place at the bus level, it also means atomic operations operate at bus speed. And from my hasty measurements, it even takes more than one bus cycle.
But I won’t expect you to take my word for it…here’s a microbenchmark on my laptop. I used gcc -O2 (gcc optimized away the loop into a mul, which is why I used two adds):
int x=0, y=0, i;
for(i=0; i<100000000; i++) {
x++;
//y+=x; // normal add = 0.1s
__sync_add_and_fetch(&y, x); // lock add = 2.8s
}
printf(“%d\n”,y);
Anyways, using “y+=x”, the loop ran in .1s and the underlying opcode was just “add”.
Using the gcc atomic function, the compiler emitted “lock add” and the loop executed in 2.8s.
The first can execute in parallel on SMP, the second becomes serialized.
Ideally I’d write a more precise assembly language test, but I think this example is sufficient to demonstrate my claim.
“Nonetheless, if you combine atomicity and MT, I cannot foresee why a good implementation will not outperform simple threaded and/or async as described.”
Here are a few reasons:
Threads are best suited for longer running tasks when MT overhead is minimal compared to other processing. However for IO, operations are frequently followed by more IO operations (read/write loops). Very little time is spent in the threads doing real work. CPU state context switching overhead becomes enormous.
The async model can handle the same IO from one thread in a queue with no context switching at all. Using an AIO interface, it’s possible to batch many requests into a single syscall.
MT is inherently penalized by the need for synchronization overhead, async is not.
Excessive use of MT in certain designs result in unscalable stack utilization (even with enough RAM there’ll be CPU cache performance issues).
Alfman,
I kinda get what you mean. I was talking about non-blocking schemes, not blocking schemes, so I think we are a bit out of sync ourselves.
I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe — it does not even try to block the content being overwritten twice! (ABA problem). As such, I doubt your atomics is exactly the same as my CAS solution.
If my reading is not failing me, the CMPXCHG instruction should be hardware-non-blocking and yet atomic, which is kinda like heavenly. The LL/SC is even better — it does that and avoids ABA, at the slight cost of speed. Definitely nothing as long as the “lock add” you are referring to.
The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case.
Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex. Or else it would be kind of defeating the point of such an invention.
xiaokj,
“I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe — it does not even try to block the content being overwritten twice!”
cmpxchg — no lock
lock cmpxchg — explicit lock
xchg — implied lock
cmpxchg (without lock) is only multithread safe on a single processor (or perhaps a process with affinity set).
You had said:
“…On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC…”
On x86, the cmpxchg instruction must have LOCK asserted in order to be atomic, and I think the wikipedia article agrees.
There are no LL/SC opcodes on x86, and I cannot comment on them.
“The LL/SC is even better — it does that and avoids ABA, at the slight cost of speed.”
The ABA problem can be solved by using 64bit cmpxchg on 32bit pointers, such that that the cmp can make sure that changes we intend to apply are still valid at the point of execution for cmpxchg.
I’d need to look up code to remember the details.
“The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case.”
Ok, I see. On a preemptive single processor system, cli can provide atomicity (assuming cpu ring-0).
“Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex.”
This is true without the lock prefix. Lock is what causes it to stall.
The x86 cache coherency model is complicated, but I believe this is what could happen without a lock.
CPU A – read mem X into cache
CPU B – read mem X into cache
mem X = 10
CPU A cache X = 10/clean
CPU B cache X = 10/clean
CPU A – cmpxchg (cmp = 10 ok, set X=13)
CPU B – cmpxchg (cmp = 10 ok, set X=15)
mem X = 13 or 15? (which got there last?)
CPU A cache X = 13/dirty or 15/clean
CPU B cache X = 15/dirty or 13/clean
Since the cache is operating behind the scenes independently from the processing circuitry, it provides no guaranties that the initial value read is still current at the time of the write.
Therefor, without a lock, the atomicity of one of the cmpxchg instructions is broken.
“Or else it would be kind of defeating the point of such an invention.”
I’d argue that a slow lock is still better than no lock at all. And “slow” is relative, accessing a shared resource is inherently a serial operation which cannot be parallelized – I don’t know how reasonable it is to expect an SMP system to do it for free.
Alfman,
Thanks for the much needed clarification!
Clearly, the atomicity without locking can only happen if the process has processor affinity. Nevertheless, this may be a worthwhile handicap given the multitude of cores we have these days!
There have been Operating Systems that implement Asynchronous I/O in existance for years.
I first programmed one in 1975/6. It was called RSX11-D
The basic IO call was QIO ( Queue !!!!! I/O )
OOTB it was Asynchronous.
If you wanted is Synchronous you used the QIOW version
(Queue I/O & Wait)
You would issue a QIO call and sometime later either test an event flag or an AST (Asynchtonous System Trap) would be fired.
This was carried over into VMS and is still there today.
There are many Async queueing/messaging products around. However many people (more used to WebServices & Unix) find it difficult to accept ‘Fire & Forget’ operations. I use one of these every day. It is called WebSphere MQ or good old MQSeries. You can turn it into a Request/Reply model if you want but with reliable delivery why not drop a message onto a queue and then get on with doing something else.
I studied Queueing Theory as an undergad in 1975 and wrote a Software Queueing system in 1981 at DEC. IT was Async in operation. IT was embedded in another product.
For me doing things synchronously is … well just wrong. We live in an async world. Why is so much software written using the Synchronous model? Laziness I suppose.
shotsman,
“There have been Operating Systems that implement Asynchronous I/O in existance for years.”
All the posts saying this is “old news” are completely missing the point – nobody’s claiming that it is “new”. If articles were restricted to “news”, then we’d be left with legal articles and product announcements with very little technical merit at all.
All of us know that many models have been exploited in computing history. It’s still a good question to ask which model best fits today, and the fact that there is a debate on these topics is a good thing in my opinion.
That said, thank you for your recollections!
“I first programmed one in 1975/6. It was called RSX11-D”
Was this a parallel computer?
“There are many Async queueing/messaging products around. However many people (more used to WebServices & Unix) find it difficult to accept ‘Fire & Forget’ operations. I use one of these every day. It is called WebSphere MQ or good old MQSeries.”
IBM mainframe computers use this mechanism almost exclusively. biztalk does it too. I think it’s a powerful and scalable concept which is lost in modern CS courses.
AS MQSeries originated on the IBM Mainframe then I’d say we are in agreement.
As for 1975/76. I was using a DEC PDP-11.
Both the RSX-11 & the RT-11 Operating systems were Async IO based. Writing device drivers for them and also VMS was a really enjoyable experience.
The concept of an AST (Asynchronous System Trap) is really nice. This is a bit of code that gets executed when the I/O you have requested completes.
All of this non-blocking IO being mentioned is great in theory; however, I have see countless cases of software re-written be be non-blocking and being slower.
A high profile example is the NIO connector for Apache Tomcat 6. Using the “new IO” libraries in recent JVM releases it promised to be faster than JIO (standard Java IO) and APR (Apache httpd IO). In every production environment I have ever benchmarked (and in the O’Riely Tomcat book) JIO is found to be fastest.
The Tomcat programmers are clearly experienced developers who know what they are doing, yet they failed to make their more complicated non-blocking connector faster…
Most aync implementations imply more syscalls- My thought, if reimplementing software to use async io, benchmark it! Just becuase your model/algorithm is faster in theory does not mean anything.
Can I live without blocking calls? Sure! But often it is simpler to be blocking and does not impact performance.
voidlogic,
“All of this non-blocking IO being mentioned is great in theory; however, I have see countless cases of software re-written be be non-blocking and being slower.”
I’ve seen it go both ways, so I think some detective work is warranted.
“A high profile example is the NIO connector for Apache Tomcat 6….In every production environment I have ever benchmarked JIO is found to be fastest.”
I can accept that this is true.
“The Tomcat programmers are clearly experienced developers who know what they are doing, yet they failed to make their more complicated non-blocking connector faster… ”
As far as I know, tomcat is still MT and isn’t using an async model. They merely converted a blocking threaded version to a non-blocking threaded version, and the results got worse. My guess is that dynamically create threads in the NIO version, which is more expensive than blocking existing threads.
This is just pure speculation on my part, but it could explain why they didn’t get any speedup.
“Most aync implementations imply more syscalls-”
Could you elaborate? I don’t think it’s true. In fact the opposite is sometimes true using epoll/libaio interfaces since we can handle several sockets asynchronously in a single syscall.
Linux support for AIO is still immature though.
“My thought, if reimplementing software to use async io, benchmark it! Just becuase your model/algorithm is faster in theory does not mean anything.”
Agreed.