Having read the feedback resulting from my previous post on interrupts (itself resulting from an earlier OSAlert Asks item on the subject), I’ve had a look at the way interrupts work on PowerPC v2.02, SPARC v9, Alpha and IA-64 (Itanium), and contribute this back to anyone who’s interested (or willing to report any blatant flaw found in my posts). I’ve also tried to rework a bit my interrupt handling model to make it significantly clearer and have it look more like a design doc and less like a code draft.
You forgot 6502 you insensitive clod
(Yes, that was the last time I coded interrupts)
Hmmm…
…
Unfortunately, I’ll ask kernel/user mode separation and paging capabilities among the minimal CPU features requirements for my OS, and I don’t think the 6502 provided those, so this submission is invalid ^^
(More seriously, I totally could have a look later for the sake of extreme completeness)
Edited 2011-05-19 22:08 UTC
http://theosperiment.wordpress.com/2011/05/20/interrupts-on-the-mos…
Happy ? ^^
I am not sure what you mean by PowerPC v2. But there is more than one interrupt handling theme on PowerPC.
Some (like the e200 cores) use hardware vectoring but can be switched back to SW mode.
The e300/e500 cores do only SW interrupt handling.
IMHO HW vectoring pays off only if one does use not any kind of (RT)OS.
Anyway, besides the peripheral interrupts, there are always the exceptions.
Ah, and there are sometimes two vectors for peripheral interrupts.
I’ll check once I’m at home and have my stack of manuals at hand, but I think there was something like “PowerPC v2.1” written on the first page of the manual, so I assumed that this architecture had, like ARM, seen several revisions.
Edited 2011-05-20 11:47 UTC
Actually the last public Power (now w/o PC) ISA is 2.06.
But it does not describe interrupt handling.
Much of it depends on the implementation. Means IBM does it different from Freescale. And Freescale even has different interrupt handling depending on the core.
On my first book, it’s written “PowerPC User Instruction Set Architecture
Book I
Version 2.02”, so I don’t have the latest version. This may explain some things.
In the book 3, called “Operating environment architecture”, there’s a whole chapter (chapter 5) dedicated to interrupt and exception handling. It noticeably mentions an “external” interrupt, that seems to centralize all the external interrupts managed by implementation-specific hardware.
Edited 2011-05-20 22:02 UTC
2.06 differs between -S(erver) and -E(mbedded) Book III
Both have different exception handling.
But common is, that peripheral interrupts are seen to be “extern” to the core.
On Book III-E CPUs (Core + interrupt controller), interrupts often can be routed to either the traditional “External” or to the new “Critical” interrupt.
In a OS environment, the critical interrupt can be used to bypass the OS completely thus running without jitter.
BTW: On ARM cores other then Cortex-M, the behavior is alike. You have the “normal” interrupt and fast interrupts.
Cortex-M is special, as the interrupt controller is part of the core _and_ the core stacks the registers defined as volatile by the ABI (so ARM argues, you need no assembler anymore).
Alright, I’ll mention that. Can you tell me where I can download this book ? I’ve had some issues finding the handbook on IBM Developer and the Power’s website, that’s probably why I’ve ended with an old version.
(Otherwise, I’ll have to ask you some questions about it, like: is there any difference with the old version save for the new “FIQ” interrupt ?)
Edited 2011-05-21 06:55 UTC
Here:
http://www.power.org/resources/downloads/
Updated. Thanks for the book !
I am concerned that you are still claiming that running threads is more scalable than a non-threaded approach. It is false. More flexible, yes, more scalable (or better performing), no.
As per usual, I’ll note that threads are best suited for CPU bound tasks with relatively rare synchronization. Popup/Micro-threads doing trivial tasks are typically short-lived with high overhead.
Thread creation and accounting cost can be reduced to the bare minimum, but synchronization primitives will always be an inherent bottleneck since they cannot operate at CPU speeds and must be serialized over the bus. As the number of micro-threads increases, the bus itself becomes a point of contention and thus the entire system looses performance.
Again, I’ll refer to the asynchronous model as having solved these problems. By handling events asynchronously without threads, all of the overhead disappears since no blocking/synchronization is necessary. Since only one stack is needed for CPU, it should fit entirely within the CPU’s L1 cache.
Tasks can even consume many more CPU cycles and still beat the micro-threaded model.
This doesn’t even get into the prevalence of multi-threaded bugs and the difficulty in debugging them.
If a long running thread is really needed, then the asynchronous code can spawn one off deliberately (and still without blocking).
If you are using threads to achieve some type of kernel protection between stacks, then I wish you’d just say that. Please don’t claim that micro threads are used for scalability when other alternatives are technically more scalable.
I know this is just a rehash of our previous discussions, but it is still a relevant criticism.
Anyways, since you don’t mention it, how do you handle the acknowledgement of interrupts? Is this done implicitly by the global interrupt handler? Or is this done within the thread that handles the interrupt?
The reason I ask is if you ack the int right away, then the hardware could fire off another interrupt and have you create a new thread for the same device before the first thread is serviced.
Do you protect from this somehow or do you leave it up to individual drivers?
If things which can run in parallel run in parallel, you get better scalability than with an all-sequential model like async. That’s why we have threads at all. They are cheap (or more exactly can be made so). Why not use them ?
If the task is not CPU-bound, and we consider IO waiting times that are typically orders of magnitude larger than instruction execution times, the overhead of creating a thread, which is pretty small on modern hardware, will be irrelevant, since creating a thread is essentially a CPU/memory-bound operation.
If the task is CPU-bound, threads offer significantly better performance.
So you either lose nothing or win a lot. I don’t know what’s the problem with that.
If so much synchronization is needed that it has a significant impact on interrupt processing speed, it is indeed better to use async. That’s another reason (apart from the big developer laziness one) why I support both models.
That’s why I talk about ease of programming. Multi-threaded programming is harder. You have to get in the right mindset to get it right. But once you get some basic coding patterns that prevent deadlocks and races, I don’t think there are so many things that can go wrong.
Which implies going back to the kernel, creating a thread, and waiting until the scheduler dares to run it. Why not have the kernel just do it right away ?
Again, how can processing events sequentially be any more scalable than processing them in parallel ?
But yeah, you’re right, separate threads have also other benefits, like a stack/CPU state cleanup each time a new event is processed, or way to introduce “timeout” mechanisms so that if an event takes too much time to be processed, the driver simply reverts the changes and jumps to the next event (effectively reducing the impact of hung drivers).
Yeah, I implicitely included it in “enabling interrupts again”. Drivers shouldn’t be trusted for that, in my opinion.
Yup, that’s the whole point of using a threaded model at all. If you don’t want this behaviour, you use the async model and run the pop-up threads sequentially.
Drivers have to explicitely make a choice between using a threaded or an async model. If they use a threaded model for an interrupt where races can occur, they know what they’re doing and it’s up to them to implement the relevant synchronization mechanisms.
Edited 2011-05-20 22:35 UTC
“If things which can run in parallel run in parallel, you get better scalability than with an all-sequential model like async.”
If those are your assumptions, then I can understand your conclusions, but you’re assumptions are pretty weak. Hypothetically I could perform nearly any computation in parallel by dividing it into threads. But doing so does not imply better performance. That depends on the ratio of cpu work versus synchronization overhead.
If I create a MT ray tracer on an 8 core processor, which will perform the best?
A) a new thread for each pixel
B) a new thread for each line
C) 8 threads, processing individual lines from a dispatch queue.
D) 8 threads, processing 1/8th of the image at a time
The answer here is obviously not A or B. For performance it never makes any sense to have more threads than cores.
C and D are probably close, but C could win since some cores might finish their work before others, leaving them idle.
“They are cheap (or more exactly can be made so). Why not use them ?”
The point is that it doesn’t make sense to use them just because you can, it makes sense to use them when the tradeoffs point in that direction.
“If the task is not CPU-bound, and we consider IO waiting times that are typically orders of magnitude larger than instruction execution times…”
Fine, then your report should say that: the overhead of threads is outweighed by I/O bound factors.
“If the task is CPU-bound, threads offer significantly better performance.”
This isn’t automatically true. It depends on the work/overhead ratio, especially small workloads will suffer greatly when multithreaded compared to async with no overhead.
It’s additionally possible to run a separate async handler on each core using cpu affinity such that multiple async requests can run in parallel with no synchronization overhead at all.
“If so much synchronization is needed that it has a significant impact on interrupt processing speed, it is indeed better to use async.”
“Significant” is a judgment call I leave to you. I just object to your claim that threaded is more scalable.
“Which implies going back to the kernel, creating a thread, and waiting until the scheduler dares to run it. Why not have the kernel just do it right away?”
Firstly, most I/O operations will not need threads in the first place, async is already sufficient, why go through the overhead when it’s not needed. Secondly if you do have a CPU bound operation, then the cost of a syscall should be fairly negligible. Thirdly, the cost of starting a microthread should be no worse in this scenario than when you start it by default (although I realize this is not strictly true for you since you’re using a microkernel which is subject to additional IPC overhead).
“Again, how can processing events sequentially be any more scalable than processing them in parallel?”
You can’t just put it in a thread and expect it to run faster.
The overhead for many small threads adds up compared to bigger threads which do more work.
If it costs 3000 cycles to send a thread to another CPU and another 3000 cycles to retrieve the results (thread creation+data transfer), then computations under 12000 cycles are probably better off being run serially on one cpu. Not only is it faster in real time, but it frees the bus for useful work on the other cpu.
You may shave off cycles here and there, but have you ever asked the question “do my tasks actually need threads?”
“Drivers have to explicitely make a choice between using a threaded or an async model.”
Ok, but will you get the full async performance benefits if your trying to emulate it inside a threaded model? And it still bothers me that you characterized the async model as worse for performance.
“Yup, that’s the whole point of using a threaded model at all.”
Forgive me if I’m pointing out something you already considered, but my point was the fact that you are pre-emptively acknowledging the interrupt means that the drivers have to protect themselves from re-entrant (or double threaded in your model) IRQ behavior which would not normally be possible otherwise.
In other words, the implicit ack requires drivers to have more synchronization than would otherwise be necessary if they explicitly re-enabled their interrupt when ready. It may not be a big deal, but it’s still several hundred/thousand cpu cycles wasted every IRQ.
The thing that baffles me about your model for drivers, is that it emphasizes threads which will rarely, if ever, be used without a mutex to serialize them again. If one driver forgets a mutex and accesses a device simultaneously from multiple CPUs, it is extremely likely to be a bug rather than a deliberate action.
Edited 2011-05-21 00:29 UTC
Good point, I should try to avoid having more running threads than there are available CPUs. I like this raytracer example a lot, except that I’d argue that in this case the synchronization overhead is close to zero since IIRC raytracers render individual pixels independently. Threading overhead here will be mostly caused by the cost of context switches and thread creation/anihilation. But that’s totally nitpicking.
No, what leaves me wondering here is, how I/O calls can be managed in this model.
That’s because you see, if you have 16 running threads, a 8-core processor, and suddenly 4 threads have to do I/O, then they can simply use blocking calls without worrying because 4 threads from the 8 ones that weren’t running will take their place.
Now, if you have 8 running threads and 8 tasks waiting in the dispatch queue, and suddenly 4 threads have to do I/O, then for optimal performance you need to replace these 4 threads with 4 new number-crunching threads from the dispatch queues’ task.
In this case, we have some overhead :
1/Threads must synchronize with each other so that they don’t start 4 new threads with the same pending task in a textbook example of a race
2/4 new threads must be created, which implies some context switching overhead that could have been avoided if all threads were created during a time where the kernel runs.
3/After a while, the I/O task is completed, and our 4 old threads are ready to run again. We are now dealing with 12 threads, so at this point, 4 threads must be silenced for optimal performance. How do we choose them ?
To solve problems 1/ and 2/, having lots of threads created from the start sounds like the most sensible option, if threads are sufficiently cheap. Well, I have to define “cheap”, isn’t it ? Let’s do it : threads that are not running cost CPU time when they are created and take some room in RAM because of OS management structures (a few bytes, negligible on modern computers) but also because of their stack. The CPU time cost can’t be avoided, so creating lots of threads is only interesting for tasks which run for a sufficiently long time, as you point out. But the RAM cost can totally be avoided for threads that have never run, by having the stack allocated at run time in a COW-like mechanism.
An interesting experiment to do would be a mix of my two pop-up threads models where among the threads created, only <insert number of cores here> are allowed to run from the start, and when a thread blocks the scheduler automatically picks another one from the queue.
3/ is more interesting. I think the older thread should run first, because otherwise the number of threads which have run (and, as such, have a stack) is going to grow, in a waste of RAM.
The thing is, not all interrupts are equal from that point of view. As an example, processing keyboard or mouse interrupts is nearly pure I/O. But processing a stream of multimedia data can require significantly more number crunching, to the point where the amount of I/O is negligible, especially if DMA is being used.
Sorry, did not understand this one. Can you please elaborate ?
Indeed, for pure I/O operations, it makes more sense to use pure async. As an example, an HDD driver is totally something which should be written in an async fashion, with a message queue.
I have to do more design work on this side.
Very fair point.
There is a difference in design philosophy, though : who carries the burden of creating and dispatching threads ? In one case, it’s a third-party developer, in the other case it’s the operating system. I’d spontaneously believe that the second solution would result in more people considering the threaded option, since it costs them less.
I’ve a bit went the other way around : threads sounded like the cleanest multicore-friendly way to implement the model of processes providing services to each other which I aim at, and then I’ve tried to apply this model to interrupt processing and drivers.
If by performance benefits you mention the fact that async has only one task running at the time and as such doesn’t have to care about synchronization and that pending tasks cost is kept minimal, then yes this model may provide that. If you wonder if the total cost of starting an interrupt processing job is better or worse than in a pure async model where you just send a message to the dispatcher’s entry pipe, then I can’t answer for sure, but I think that both models have advantages and drawbacks.
Having an external dispatcher in the driver process implies a slight kernel-driver communication protocol overhead which should make emulated async more powerful when the driver is initially idle. On the other hand, when the driver is initially busy, sending a message in a pipe is faster than creating an inactive pop-up thread.
Indeed, that part was bad. I wrote it too late in the evening, and didn’t take the time to write an in-depth comparison of both. I didn’t talk about performance, but scalability though.
The assumption behind this is that in many situations, the order in which things are processed only matters in the end, when the processing results are sent to higher-level layers. Like rendering a GUI : you can render individual controls in parallel, blit them in parallel as long as they don’t overlap, it’s only at the time of the final copy to the framebuffer that synchronization is needed.
I don’t know how much it holds true for interrupt processing, though.
I’m suffering from exhaustion at this point, but I’ll try to respond quickly:
As to the number of threads versus cores, you could distinguish between “scheduled threads” and “active threads”. To minimize thrashing, a scheduled thread should not start until there is a free core. Unless of course there is a deadline/priority scheduler with a pending deadline on a higher priority thread.
Hmm, I think you’ve got it figured out already.
“The thing is, not all interrupts are equal from that point of view. As an example, processing keyboard or mouse interrupts is nearly pure I/O.”
I don’t think the keyboard will be a problem.
“But processing a stream of multimedia data can require significantly more number crunching, to the point where the amount of I/O is negligible, especially if DMA is being used.”
A surprising amount of IO can be done without much CPU effort, copying files between disks, sending files over network (with adapters supporting multi-address DMA and checksum). The CPU can get away with communicating memory addresses instead of actual bytes.
However, in cases where the driver has to do more with the data (1000s of cycles), then there is no doubt a thread is necessary to prevent latency of other pending requests.
Interestingly the x86 interrupt hardware already has a form of prioritization without threads. Low IRQs can interrupt even when higher IRQs haven’t been acked yet, the inverse is not true. So even long running IRQs at low priority are theoretically ok since they’ll never interfere with high priority IRQs.
…but the problem is that the driver code is no longer atomic, and therefor must be re-entrant. It becomes impossible to use a shared critical resource from two handlers. The second interrupt obviously cannot block, since doing so would block the first interrupt holding the critical resource as well. So, while this has it’s place in extremely low latency applications, I doubt it’s helpful for you.
“Sorry, did not understand this one. Can you please elaborate ?”
Consider an async model on a single processor. Now multiply that by X processors running X async handlers.
Now, you can map each device to raise interrupts on a specific CPU, also give the driver affinity to that CPU. Now when an interrupt for the device occurs, the driver can safely assume an async model for all internal resources without any synchronization.
Assuming the devices are well distributed (statically) between CPUs, then they’ll be serviced in parallel.
The requirement to route the IRQs to a specific CPU may not be strictly necessary, but it simplifies the example and reduces latency caused by transferring the request. This could even be true for a threaded model?
“who carries the burden of creating and dispatching threads ? In one case, it’s a third-party developer, in the other case it’s the operating system. I’d spontaneously believe that the second solution would result in more people considering the threaded option, since it costs them less.”
Well, there’s no denying that last point. Unless you allow userspace threads which don’t require a syscall?
“I’ve a bit went the other way around : threads sounded like the cleanest multicore-friendly way to implement the model of processes providing services to each other which I aim at, and then I’ve tried to apply this model to interrupt processing and drivers.”
I think a clean & consistent design is a good reason to use light threads. However, part of the difficulty in providing an async userspace interface under linux (which is still a mess) was the fact that the kernel needed threads internally to handle blocking calls, even though there was no associated userspace threads to block. It’s doable but not clean.
You hit some good points further down, but I’m so tired…I’d better not tackle them today.
Edited 2011-05-21 11:20 UTC
Yup, I can’t see a reason why processing IRQs on different processors couldn’t just as well be done with a threaded model. The problem will be to balance the IRQ load fairly across all processors, I think. Some IRQs fire very frequently (clock), some never happen (MachineCheck), but many interrupts happen frequently at a moment and then stop happening for minutes or hours. As a silly example, sound card interrupts (buffer is full/empty) only occur when you play or record sound.
Well, it’s not as if I could forbid them ^^ But looking at Tanenbaum’s book (my OS development bible), they seem to have several issues. Like, you know, the fact that you cannot use clock interrupts to distribute time fairly across them, or the fact that if an userspace thread blocks for I/O, all other userspace thread are blocked for I/O at the same time, since it’s the same thread from the kernel’s point of view.
I’d be happy to read this In meantime, I’ve done my homework too : here’s a seriously written, in-depth explanation of why I use a microkernel, RPC as the main IPC method, etc, ending with a discussion of threaded model and asynchronous model based on our points in these comments, if you want to have a look…
http://theosperiment.wordpress.com/2011/05/21/clarifications-on-pro…
Edited 2011-05-21 20:13 UTC
Neolander,
“If by performance benefits you mention the fact that async has only one task running at the time and as such doesn’t have to care about synchronization and that pending tasks cost is kept minimal, then yes this model may provide that.”
The thing is, you are going to have async and threaded drivers running side by side. This implies that even the async drivers will need to use synchronization when accessing shared resources.
I throw out this as an example:
Two userpace apps read from two different files. In the threaded model, this results in two user+kernel threads blocking in the file system driver, which itself has blocked in the block device driver.
Ignoring all the synchronization needed in the FS driver (and writes), the block driver (or the cache handler) must/should create a mutex around the blocks being read so that other threads requesting the same blocks are blocked until it is read. I think such a structure would require at least two mutexes, one for the specific blocks being read, and another for the structure itself.
Therefor a thread reading a cached block would only need to synchronize against one mutex, find it’s data, and return immediately.
A thread reading an uncached block would synchronize against the structure mutex, latch onto a new or existing block read mutex, queue the disk request (if new) and release the structure mutex.
This way the structure mutex is only ever held momentarily, and the read mutex is held until the disk reads a block. After which all blocked read threads can resume.
I expect that this is more or less what you’ll be writing?
Now, my point about async drivers is that zero synchronization is needed. It’s true, that requests to this driver will be forced to be serialized.
However:
1) Many drivers, like the one described above, need a mutex to serialize requests anyways. (This could be mitigated by dividing the disk/structure into 4 regions so that concurrent threads are less likely to bump into each other, however then you need another mutex to serialize the requests to disk since IO to one device cannot be executed from two CPUs simultaneously).
2) If the driver’s primary role is scheduling DMA IO and twiddling memory structures with very little computation, then these tasks are not really suitable for parallelization since the overhead of synchronization exceeds the cost of just doing the work serially.
“The assumption behind this is that in many situations, the order in which things are processed only matters in the end, when the processing results are sent to higher-level layers. Like rendering a GUI…”
Yes, I won’t deny that some layers will benefit from parallelism. However, propagating that parallelism into drivers which are fundamentally serial in nature will make those drivers more complex and could even slow them down. These drivers will require thread synchronizations when an async model could handle it’s state without synchronizations (more on this later for the other poster).
I’d like to emphasize that I’m not arguing against the threaded model, particularly when they make or break the design of the paradigm. I’m just trying to suggest that sometimes there are cases where the best MT implementation performs worse than the best ST implementation, and device drivers may be one of those cases.
Sorry for the delay, I have been too sick for complex thinking during the last few days.
Actually, you have this problem even if you have several distinct async drivers running side by side. As soon as there are shared resources, there is a synchronization overhead.
But I’m a bit curious about why separate drivers would need to share much resources with each other. It seems that you provide an example below, though so I’m reading a bit further.
Aggggghhhhhh… Disk I/O in drivers ! u_u Banish these impure thoughts from your head before becoming a fiendish servant of Satan ! You can still see the light !
Joking aside, I think that anything that requires some reactivity (and most drivers are included) should never, ever, depend on blocking file I/O. Nonblocking file I/O (like writing things is a log without caring when it’s actually written to disk) is totally okay, on the other hand, but in this case we wouldn’t have so much synchronization problems.
I’ve already admitted before that for something as I/O centric as a disk/block device driver, where there is only little processing involved, queuing events in an async model is probably best
You see, this is actually something I’m wondering about. Are all drivers fundamentally serial in nature and doing little processing ?
Some time ago, I was wondering about an efficient way to do an FTIR touchscreen driver. In case you’ve not heard about those, their sensitive layer uses infrared light trapped within a piece of glass through total internal reflexion, that can only be scattered outside of the glass when something (like, say, a finger) comes in contact with the it. A webcam captures the infrared image, everything from that point must be done in software.
So we have to do a webcam driver, a blob detection/tracking system, and some kind of real time video output.
Now, I don’t know what kind of data comes out of a webcam, but I guess it varies from one webcam to another. Maybe some will output raw bitmap pictures, some will output jpeg frames, and some will output MPEG-ish videos with i-frames and p-frames. Most peripherals send an interrupt when their output buffer is full, so we do not know how many frames will have to be processed at once (especially for variable-sized frames like in MPEG video).
Suppose that our blob detection/tracking algorithm works with bitmap data. Then there is some amount of image conversion involved. In case the frames are sufficiently independent from each other (not guaranteed for MPEG, but certainly the case for JPEG), it’s better to do the decoding in parallel because it’s essentially a cpu-intensive operating, and nothing scales better on multiple cores than independent operations.
Blob detection and tracking might work first by locating blobs in the initial pictures the brute-force way (tresholding + noise removal + locating sufficiently large packs of pixels in the whole picture) and then by only looking for the blobs in a region around their former positions, based on their estimated speed. Since no writes are involved, no synchronization is needed, and looking for blobs in these slices of pictures can be done on a “one-thread-per-slice” basis, with communication between threads only needed when such slices overlap each other and it’s difficult to know to which slice a blob belongs.
Edited 2011-05-24 09:53 UTC
“Actually, you have this problem even if you have several distinct async drivers running side by side. As soon as there are shared resources, there is a synchronization overhead.”
I wasn’t careful with my wording, we can run many instances of the async model within multiple threads and/or processes, which can communicate to each other using lock free message queues. So although they can share certain resources this way, it’s not a shared resource in the traditional “I hold the mutex” sense.
“Joking aside, I think that anything that requires some reactivity (and most drivers are included) should never, ever, depend on blocking file I/O.”
I agree with you, but you do realize that this is how nearly all IO is handled in linux?
“Are all drivers fundamentally serial in nature and doing little processing ?”
I was responding to your comment about using MT graphics rendering, which I agree could benefit. However many devices are inherently serial.
“Some time ago, I was wondering about an efficient way to do an FTIR touchscreen driver.”
Hey I built one myself! Drilled LED holes into a picture frame, except I used visible light.
“Blob detection and tracking might work first by locating blobs in the initial pictures the brute-force way”
That’s a delightful project, I built something very similar in college, but they took down the project page. This is exactly the kind of thing I wanted to do with my degree!
“Interesting problem, actually. I think once one thread starts to hog the memory bus like this, we’re doomed anyway”.
Yes, but the point of my example was that IO bound processes don’t benefit by running in parallel since they can displace CPU bound processes which do run in parallel. We already seem to agree on this point.
Edited 2011-05-24 20:18 UTC
I hope I’m not annoying you too much, that’s not my intent.
If you can demonstrate that threads make your paradigm better, then go for it.
My main gripe is that, after all our talks, you still used blanket statements like the following:
“The former [threaded operation] is obviously significantly better for scalability…”
No, no Feedback like yours is extremely precious, because I’m one of these stubborn guys that will only listen to criticism with more than one ear when it includes argumentation and claim backing.
As said above, it has initially appeared to me that pop-up threads were an extremely good option in the context of processes providing independent services to each other. What I’m trying to do here is to keep the design complexity low by creating a universal yet simple paradigm that can be used to solve an insane lot of problems at once, kind of like Unix systems make everything resolve around text files and pipes.
Apologies for that one As said above, I have to rewrite this part anyway as it lacks completeness. This pop-up thread thing should have a full dedicated article anyway, considering how extensively I want to use it, not just a two-paragraph description in an article about interrupt handling.
Edited 2011-05-21 09:44 UTC
This site has some interesting benchmarks for a couple low level synchronization primitives.
http://www.spiral.net/software/barrier.html
The obvious locking mechanism on x86 is to use a “lock bts” instruction. According to above this consumes 530 cycles on a modern dual core processor. And 1130 cycles on a 4 core multi processor system.
They’ve proposed another algorithm, which is incidentally the same algorithm sometimes used to synchronize multiple clients over a SMB file system.
For example, Quicken Pro maintains it’s database on a file share without actually running a “server”.
Anyways, I am surprised they were able to achieve performance gains this way. 220 cycles on a dual core and 1130 cycles on a 4 core multiprocessor system.
It relies on the cache coherency characteristics behind the x86, which I have no idea how portable it is across architectures.
You may want to study the approach, however I don’t think it will be useful because the algorithm is dependent upon a fixed number of threads (and presumably it gets worse as that fixed number is increased).
Anyways, I still argue that a lock free asynchronous model without any threads will easily outperform the “popup thread” model, particularly for simple tasks which consume 1-2K cycles.
I don’t care which way you go, it probably won’t matter either way, but please stop clinging to the view that threads are more scalable.
This is a very interesting discussion thread, but I’m wondering what kind of model you envision when you talk of a lock-free, asynchronous, thread-less model. Can you point me to a description or more elaborate discussion of such a model?
jai_,
“This is a very interesting discussion thread, but I’m wondering what kind of model you envision when you talk of a lock-free, asynchronous, thread-less model. Can you point me to a description or more elaborate discussion of such a model? ”
More than a vision… I’ve actually implemented it in userspace on linux, and working on a windows port now, although I haven’t made the project public.
Technically, my vision is to convert all calls which either block or could take a long time into asynchronous callback mechanisms.
So basically these function calls below are from my API.
When any of the tasks complete, the registered callback functions are invoked – all within a single even oriented thread.
async_file_create(&file, &file_cb);
async_file_open(&file, “async.test”, ASYNC_FILE_READ | ASYNC_FILE_WRITE | ASYNC_FILE_CREATE | ASYNC_FILE_TRUNCATE);
async_process_create(&process, &process_cb);
async_process_setup_stdout(&process, &process_out_cb);
async_process_setup_stdin(&process, &process_in_cb);
async_process_execute(&process, “ls -l /”);
async_timer_create(&timer1, &timer_callback);
async_timer_schedule(&timer1, 500);
async_resolve_create(&resolve, &resolve_cb);
async_resolve_lookup_host(&resolve, “av.com”, ASYNC_ADDRESS_IP4);
async_address_create(&remote_address);
async_address_set(&remote_address, “127.0.0.1”, 5555, ASYNC_ADDRESS_AUTO);
async_tcplisten_create(&lsock, &lsock_cb);
async_tcplisten_listen(&lsock, &remote_address);
async_address_destroy(&remote_address);
So, above, all those actions are performed simultaneously, and they callback when complete.
One callback could look something like this (sorry about OS News formatting bugs):
void resolve_cb(ASYNC_RESOLVE*resolve) {
char buffer[128];
ASYNC_ADDRESS address;
async_address_create(&address);
while(async_resolve_get_address(resolve,&address)) {
printf(“Address %s\n”, async_address_get_ip(&address, buffer, sizeof(buffer)));
}
async_address_destroy(&address);
async_resolve_destroy(resolve);
}
The thing which I find so cool about this model is that two completely independent processes (or applications for that matter) can be run together inside this model without interfering with each other. Nothing (apart from the main handler) ever blocks.
Threads are supported for long running tasks, and when the threads complete their work, they invoke a callback into the async handler much like any other blocking call.
Unfortunately, under linux, there are many blocking calls lacking an async interface and are implemented using threads in the kernel. It forces async userspace apps to wrap threads around certain kernel calls for no good reason other than the kernel requiring it. I consider it shortsighted, but many people would bash my head in if I say that too loudly.
Instead of burning through threads quickly, my implementation caches minimal proxy threads dedicated for this purpose.
Still, in this model proxy threads are unnecessary overhead which serve absolutely no purpose other than getting around blocking calls.
I think this model would work very well in the kernel itself. Instead of blocking a thread, a request could just register a callback and be done until the callback occurs.
The async model has one extremely important property which is inherently difficult for threads: well defined cancellation behavior. I can easily cancel an async IO request, canceling a blocked thread turns out to be extremely problematic.
In fact, pthreads has special locks programmers should use to ensure the cancellation doesn’t cause corruption. Unfortunately for my model, this means thread cancellation itself can block and there’s nothing I can safely do about it.
Even to this day, this is the cause of many aggravating linux kernel CIFS lockups where one cannot kill a hung connection, but has to wait for kernel threads to timeout.
Ah well, I can’t have everything.
One clairification…
My post may have given the impression that a thread is used for all syscalls, which is not the case, only for those which may block.
For example, linux sockets support non-blocking IO, which I make use of.
Linux file IO blocks, forcing me to use threads for those syscalls if I want to continue handling other requests asynchronously.
Here’s an interesting link describing an attack against apache caused by the fact that it uses a thread per connection. IIS is not subject to the same attack because it’s asynchronous.
It may not be a completely fair comparison, but then it does suggest something about the scalability of each approach.
http://lwn.net/Articles/338407/
Actually, I still don’t quite understand how this works, but it seems to me that the problem is that there a limitation to the number of connections which Apache can simultaneously keep open (or to the number of running threads, whichever is smaller). So for the comparison to be fair, IIS should have a similar limitation to the number of pending tasks.
And even then, the Apache guys have obviously not a good metric to detect if a connection is active or not here. If someone just keeps a connection open and does nothing with it, I’ll simply ban that someone as soon as new incoming requests are coming. To follow the shop analogy : a cashier will be simply irritated by someone coming and paying with pennies if no one else is there, but if there are lots of people waiting in the queue, he’ll use that excuse to either have the person pay in a faster way or kick it without its articles.
Edited 2011-05-24 11:31 UTC
Alfman, thanks for the elaborate answer. What I’m wondering about though, is how do you sync all these asynchronous calls that have a linear dependancy. E.g. when opening a file, then reading, then closing it asynchronusly, since the asynchronous nature does not say anything about order of execution, you could have an attempt to read a unopened file, or a file that has been closed before it has been read. So you’d need some synchronisation primitives, which typically means blocking (or spinning, which may be as bad).
Also, in your resolve_cb, there’s a while-loop that tests for the result of an async call. Since it’s async, it cannot actually wait for the result of the call, so all it can do is check the result of the invocation of the call, right? Or do you have some mechanism that blocks the while-loopthread until an answer arrives? I guess not, as that more or less defeates the purpose?
jal_,
“What I’m wondering about though, is how do you sync all these asynchronous calls that have a linear dependancy. E.g. when opening a file, then reading, then closing it asynchronusly, since the asynchronous nature does not say anything about order of execution”
That’s a good question, although somewhat complex to describe abstractly.
Remember that async functions can be called synchronously at any time, since they never block by design.
If there is a dependency between the start of an async request and the result of another, then the second async request should be executed within the callback of the first instead of simultaneously.
All async objects have state within them indicating whether they are busy or ready if that is needed. However since the callbacks only trigger on transition from busy to ready, it’s not typically necessary for an explicit check.
“you could have an attempt to read a unopened file”
In this case, the file open callback tells you that the file is open (or that there was an error). It’s not until this point that it is valid to read the file.
“or a file that has been closed before it has been read.”
While I actually do have an event for file closed by the OS, this is actually not supported by *nix or windows. What happens is a read/write failure instead. The file is always closed by the user code, therefor it should never happen at an unanticipated time.
“So you’d need some synchronization primitives, which typically means blocking (or spinning, which may be as bad).”
When an IO event occurs, we handle it immediately in our callback. If that means initiating another async IO request, then it is done without blocking. If it means initiating a CPU bound task, then we fire off a thread. Then our async callback function returns without ever having blocked. So we’re always ready to handle more IO. Since I use eventfd and pollfd to retrieve IO status from the kernel, I can even retrieve several IO events with a single kernel syscall.
I don’t know if I addressed your concern or not?
There are no thread synchronization primitives necessary since every async callback can safely assume that no other async callbacks are running in parallel. The state of the objects it sees will never change behind it’s back, so there’s no need for a mutex to protect anything.
It’s often referred to as a giant state machine, which sounds complex, and it could be if it were designed like a kitchen sink win32 event loop. My implementations encapsulates all that complexity such that tasks don’t need to know or care about other tasks, only handling the IO they are concerned with. So the code is the same if it’s handling one client or 10K.
“Also, in your resolve_cb, there’s a while-loop that tests for the result of an async call. Since it’s async, it cannot actually wait for the result of the call, so all it can do is check the result of the invocation of the call, right?”
Right and wrong.
The async_resolve calls are implemented on top of the blocking “gethostbyname2_r” implemented in libc. So when the resolve call is initiated, the async library initializes a proxy thread and executes gethostbyname2_r in the background (this is completely under the hood and does not concern the async users. In theory a native asynchronous gethostbyname function could be written for libc…but realistically it may never happen).
In the meantime, the async program can continue handling events, once gethostbyname2_r is done, the user’s callback is executed, in this case resolve_cb.
async_resolve_get_address is a helper function which retrieves all the IPs which were fetched by the libc call, like other async fucntions, it doesn’t block since the IPs are already fetched.
“Or do you have some mechanism that blocks the while-loopthread until an answer arrives? I guess not, as that more or less defeates the purpose?”
No, it doesn’t block, resolve_cb (callback) isn’t executed in the first place until data is available.
EDIT : Nevermind
Edited 2011-05-24 11:36 UTC
Thanks for this even more elaborate answer. I think all my questions have been answered, though it would be difficult to envision a programming model that, with all the callbacks, would make it clear how the program flow goes. Some 4GL maybe… Food for thought!
“though it would be difficult to envision a programming model that, with all the callbacks, would make it clear how the program flow goes.”
It certainly is different if you’re not accustomed to it.
For threads, a linear sequence of events can be visualized easily. (the complexity arises from interaction with other threads).
On the other hand, in the async model, I can look at any single event callback, look at the state of the system and determine exactly where to go from there to get to the next state. Debugging individual callbacks becomes almost trivial. It is far easier to prove correctness under the async model than the thread model.
Here is a short program which tests a multithreaded checksum.
Nothing impressive here. All that’s interesting are the benchmarks below. (Dual-Core Intel(R) Xeon(R) CPU E3110 @ 3.00GHz)
// twiddle.c — Compare performance of single threaded and multithreaded code.
// gcc -O3 -o twiddle -lpthread twiddle.c
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
typedef struct {
int*mem;
int len;
int ret;
} WORK;
double Now() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (float)tv.tv_usec/1000000;
}
void* DoWork(void*work) {
// Do some arbitrary nonsense work
WORK*w = (WORK*)work;
int*mem = w->mem;
int*stop = mem + w->len;
int sum = 0;
while(mem<stop) {
sum+=*mem;
mem++;
}
w->ret = sum;
}
int main(int argc, char*args[]) {
int len=500000000; // 500M ints, or 2GB
int*mem = (int*)malloc(len * sizeof(int));
WORK work[2];
double elapse;
pthread_t thread[2];
int run;
memset(mem, 1, len * sizeof(int));
printf(“Array Len = %d\n”, len);
for(run=0; run<3; run++) {
// Method 1 – Do work in one shot
elapse = Now();
work[0].mem = mem;
work[0].len = len;
DoWork(work);
elapse = Now() – elapse;
printf(“Run %d Method 1 – One shot = %d (%0.2fs)\n”, run, work[0].ret, elapse);
// Method 2 – Split work in half
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
DoWork(work+0);
DoWork(work+1);
elapse = Now() – elapse;
printf(“Run %d Method 2 – split single thread = %d (%0.2fs)\n”, run, work[0].ret + work[1].ret, elapse);
// Method 3 – Split work in half, threaded
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
pthread_create(thread+0, NULL, &DoWork, work+0);
pthread_create(thread+1, NULL, &DoWork, work+1);
pthread_join(thread[0], NULL);
pthread_join(thread[1], NULL);
elapse = Now() – elapse;
printf(“Run %d Method 3 – split dual thread = %d (%0.2fs)\n”, run, work[0].ret + work[1].ret, elapse);
} // run
}
Output:
Run 0 Method 1 – One shot = 1345479936 (0.36s)
Run 0 Method 2 – split single thread = 1345479936 (0.36s)
Run 0 Method 3 – split dual thread = 1345479936 (0.31s)
Run 1 Method 1 – One shot = 1345479936 (0.36s)
Run 1 Method 2 – split single thread = 1345479936 (0.37s)
Run 1 Method 3 – split dual thread = 1345479936 (0.33s)
Run 2 Method 1 – One shot = 1345479936 (0.37s)
Run 2 Method 2 – split single thread = 1345479936 (0.37s)
Run 2 Method 3 – split dual thread = 1345479936 (0.31s)
Now, did this achieve the speedup we’d like? Since there is virtually no synchronization between threads to cause overhead, one might have expected 2 threads to cut the time in half, so why didn’t it?
The answer is that the CPU is way under utilized, the algorithm is actually IO bound and that the CPU spends most of it’s time waiting for data.
Now which is better/more scalable?
Taking up one CPU for .37s, or two CPUs for 0.31s?
If this is the only process, then admittedly yes .31s is better. However, if there are any other CPU bound processes waiting, then the .06s tradeoff is starting to look pretty expensive since I could have been running a CPU bound process simultaneously for .37s instead of tacking it on top of .31s.
Keep in mind we haven’t even introduced the need for multithreaded synchronization, which is extremely expensive.
For these reasons, it’s typically better to exploit parallelism at the macro-level instead of parallelizing individual steps. This is usually a good thing because it’s easier to do with much less synchronization.
It also lends credence to my cherished view that async IO interfaces usually can perform better than blocking threaded ones.
I was going to argue that CPUs don’t have to do nothing while a process is blocking for I/O (one just has to switch to the next process in the scheduler’s queue), but then I had a look at your code and noticed that it was actually essentially about accessing lots of RAM, in which case it’s the CPU core that has no choice but to block, since MOVs are blocking by design.
Interesting problem, actually. I think once one thread starts to hog the memory bus like this, we’re doomed anyway, since all other threads on all other cores are going to fight for access to the memory bus and get suboptimal performance. This problem won’t fully be solved until the memory bus goes faster than all CPU cores together, which itself is probably not going to happen until we reach the limits of CPU speed. But this is open to discussion.
As said in the blog post mentioned earlier, if you need so much synchronization that your algorithm performs no better in parallel than in a serial version, then it’s indeed better to do it the async way and not bother with the synchronization’s coding complications. What I consider is tasks where most of the work can be done in parallel, and synchronization/communication is only needed at some specific points of the execution.
And there comes a trade-off between fine-grained modularity and macro-level parallelism I think this one can be solved by not pushing function isolation through a process too far. If I take the FTIR touchscreen driver example above, I think that putting a boundary between “large” tasks like decoding pictures and detecting blobs is okay, but that putting a boundary between two FFTs in the JPEG decoding process is not. Where exactly the limit is can only be decided through measurements on the real-world system, and will probably evolve as CPUs themselves evolve.
Again, if it’s pure I/O and (almost) no computation, I admit that it’s totally possible, and even frequent
When I saw your operating system articles, I thought they were the norm for OSAlert, and I was intrigued by it. But honestly, the non-technical media hypefest going on here seems to be as bad as other sites. The technical articles like this don’t even generate much interest (no offense). It’s over people’s heads here.
jal_, you’re ok in my book, after all, you knew about “intel-mnemonic”!
I’m just looking for some place to fit in better. I have the same problem in my offline life – I don’t know people with my interests. I can’t even find a relevant job to apply my skills in. Employers seem to be judging my past employers and connections (who are not impressive) instead of my personal abilities (which I think are well above the curve).
Do either of you feel totally under-appreciated?
Edit: Well, my topic seems a bit ironic now, but I’ll leave it!
Edited 2011-05-23 17:44 UTC
Ah, these articles I still have the architecture&implementation parts on my to-do list, you know, though I’ve been a bit too busy to do that recently and only now start to get some spare conscious time again.
Well, let’s put it this way : I see OSAlert as not being strictly about the gory details of OS internals. Some people just come here to read about the evolution of existing OSs and the ecosystems around them.
It’s like car magazines : some people who read them are actual car mechanics who could put a car together from spare parts if they had to, but most people read them simply to learn about new cars and things which car mechanics can do for them.
If you want something more OS development-specific, I highly recommend OSdev’s forum ( http://forum.osdev.org ). The bulk of the traffic is about people having implementation issues, in the slightly misnomed “OS development” subforum, but there are also regularly more interesting theoretical discussions in the “OS Design&Theory” and “General Programming” forums.
You may prefer like me the constructed, easy on the eyes articles of blogs to forums for daily viewing, but except for OS designer’s blogs like mine, which tend to have low and irregular publishing rates, I don’t know of any blog specifically targeting this subject.
Well, looking how much hits my blog gets each time I publish something about it on OSAlert… No. (see http://imageshack.us/photo/my-images/708/statsio.png/ ) I don’t feel like articles on the subject are under-appreciated.
I just wish I had a more regular subscriber audience, but I guess this is not going to happen, at least until I have something more tasty for normal people than kernel internals to write about (like, you know, this resolution-independent and resizing-friendly tiling GUI which I care so much about).
The current version of OSAlert doesn’t allow you to change it anyway =p
Edited 2011-05-24 12:02 UTC