« Concrete Blocks | Home | Internal Memory Bottlenecks and Their Removal »
November 13, 2009
"Look Ma, No Busywaits!"
When the CPU needs to do something
which depends on a result which the GPU is currently working on, it has to wait for the GPU to catch up. One of the biggest problems with the current architecture of xf86-video-glamo, both DRM and non-DRM versions, is that they do this waiting by spinning in a tight loop, each time checking the current status of the GPU, until it's caught up. This isn't great for a few reasons. It makes no use of the parallelism between the CPU and the GPU, so precious CPU time is being wasted while something more useful could be being done. If there's nothing else to do, then the CPU could be sleeping - reducing power consumption.
Most GPUs, including Glamo, have a mechanism for being a little smarter. The kernel can ask the chip to trigger an interrupt when a certain point in the command queue has been reached. When a process needs to wait, the kernel can send it to sleep and watch out for the interrupt. When it happens, the process can be quickly woken back up in a low-latency fashion, meaning that the process gets back to work with very little latency.
This week, I've been implementing this kind of thing for the Glamo DRM driver. It goes a bit like this:
Things aren't always so great. When the command sequence to be executed is very short, the overheads of fencing and scheduling become significant, and the overall rate drops. However, it shouldn't be too difficult to design some kind of heuristic to use busywaits as a low-latency strategy in such cases.
There are still a few problems to iron out. The fence mechanism seems to be able to fall out of sync with things, leading to processes waiting for too long (or even forever). But when it works, some things do seem to feel a little faster in general use.
Geeks may be interested in the actual code.
Most GPUs, including Glamo, have a mechanism for being a little smarter. The kernel can ask the chip to trigger an interrupt when a certain point in the command queue has been reached. When a process needs to wait, the kernel can send it to sleep and watch out for the interrupt. When it happens, the process can be quickly woken back up in a low-latency fashion, meaning that the process gets back to work with very little latency.
This week, I've been implementing this kind of thing for the Glamo DRM driver. It goes a bit like this:
- Process submits some rendering commands via one of the command submission ioctls.
- Kernel driver places rendering commands on Glamo's command queue.
- Process needs to wait for the GPU to catch up, so calls the wait ioctl.
- Kernel driver puts an extra sequence of commands, called a fence, onto the command queue. A unique number is associated with the fence. The number is recorded by the kernel.
- When the GPU processes the fence, it raises the interrupt and places a unique number into a certain register.
- The interrupt handler checks this number, and wakes up the corresponding process.
Things aren't always so great. When the command sequence to be executed is very short, the overheads of fencing and scheduling become significant, and the overall rate drops. However, it shouldn't be too difficult to design some kind of heuristic to use busywaits as a low-latency strategy in such cases.
There are still a few problems to iron out. The fence mechanism seems to be able to fall out of sync with things, leading to processes waiting for too long (or even forever). But when it works, some things do seem to feel a little faster in general use.
Geeks may be interested in the actual code.
Nice work! Thanks working on glamo. :)
Does this improvement could be bring to qtmoko (qtextended?) whereas it use fb and not a xserver?
Same question for the over work here:
https://www.bitwiz.org.uk/s/2009/11/internal-memory-bottlenecks-and-their-removal.html
The waitqueue mechanism (this post) can be used by anything which uses DRM to submit its rendering commands. I don't know if Qt uses any acceleration at all - if not, this won't make any difference. I certainly don't think anything other than xf86-video-glamo (our X.org driver) and Mesa, plus a few demo programs, are currently using DRM. But they could make use of it if they wanted.
The FIFO tuning (the other post) makes anything which uses acceleration faster, regardless of how the commands are submitted. But again, it only helps programs/frameworks which actually make use of that acceleration rather than just taking a dumb framebuffer.
THanks for your answer. :)
I was one of those asking for comments on your website... just to thank you for your work (hope this helps in feeling some interest in what you're doing).
d
Cool! Thanks for the insight.