X performance is extremely difficult to quantify, and there is a large amount of work to be done in this space. The following are some collected notes on what needs doing.
Discussion and questions about work in these areas should be held on the xorg mailing list.
X is a fairly compact protocol, with the exclusion of image transport which is done uncompressed. The dominating factor in many aspects of X performance is transport latency - how long it takes for a response to make a round trip. This is exacerbated by the design of Xlib, which is effectively synchronous for most of its operation.
XCB is an effort to rearchitect the network layer of X to hide latency by providing an asynchronous interface to the protocol. The effort is stabilising rapidly, and needs both wider testing and for toolkits to be ported to it.
While XCB addresses the latency requirements of X, there is no great solution to the bandwidth requirements. Several good (and not so good) solutions do exist.
- The SSH suite allows for tunnelling X sessions over a secure channel, which can also be compressed. This is adequate for technical users, but it is suboptimal for terminal servers and novice users.
- The LBX protocol is generally inadequate for modern X usage. See the linked paper for details. LBX is no longer built by default in 7.1.
- Xpra acts as compositing window manager to forward individual windows from a virtual server to the client that connects to it (over sockets, tcp or ssh).
- NX is a software compression suite based on the earlier DXPC protocol. It also leverages ssh to provide security services. There are open issues regarding integrating it into the base suite - licensing being among them, as much of NX is LGPL, version 4 is now closed-source.
Several commercial X servers use shared memory transports on the local machine to improve performance. Rik Faith researched shared memory transports for the DRI project several years ago. The conclusion then was that it would improve performance for those operations where the time to render the request was dominated by the transport latency, and then by less than 10%.
This may not be true anymore. The balance of the typical machine's memory architecture has shifted, and many operating systems provide advanced high performance synchronization primitives (like futexes on Linux) that may address some of the sync overhead he experienced. This would be an excellent research project.
Most of the core X protocol's rendering routines simply do not get used very often. This is not so much because they are slow, but because they aren't useful on the modern desktop. Empirically, better than 90% of the drawing operations that X sees today are solid fills, blits, and Render operations. Attempting to accelerate 2D operation outside this set is very likely a waste of effort.
XAA is largely inadequate for accelerating modern desktop usage.
- XAA goes to great lengths to accelerate operations like patterned fills and Bresenham lines, which are rarely used.
- XAA's support for accelerating the Render extension is poor, because the design of XAA's memory manager only allows for offscreen pixmaps that are exactly the format of the displayed screen.
- Render acceleration in XAA only works when the source image is in host memory and the destination is in card memory; to be truly performant it needs to be able to handle the case where both source and destination are in card memory. This is the reason xcompmgr is slow; it uses Render heavily, and XAA simply can not make Render go fast.
XAA is really not worth fixing. The better approach is to start from the lessons learned from KAA, the kdrive acceleration architecture, and port drivers over to that (leaving XAA in place for old or unmaintained drivers).
As of about Xorg 126.96.36.199, there is a new acceleration architecture called EXA that achieves this. EXA is derived from the KAA code, but has been ported to the loadable server design and includes some additional features for improved performance. UXA is a variation on the EXA theme that assumes a unified memory architecture and kernel memory management support.
The ExaStatus page contains the current driver support status.
EXA continues to be tuned for performance. In particular, the pixmap scoring and migration algorithm is still fairly naive. The lessons learned from tuning EXA will apply to Xgl servers as well in the future.
Most modern graphics cards can be run in either linear or tiled framebuffer modes. Linear modes are simple, you start in the top-left corner and move to the bottom-right, all the way across a single row before changing rows. In tiled modes the framebuffer is broken up into a series of small tiles, usually 8x8 or so, and memory is laid out such that the first 64 pixels belong to the first tile, then the next 64 to the second tile, etc. You can think of linear framebuffer being a tiled framebuffer where each tile is 1x1.
Tiled framebuffers have a performance benefit because they better model the layout of objects on the screen. They give better locality of reference because each tile is packed tightly in memory, where in a linear framebuffer you might have to skip a thousand pixels ahead to get to the same horizontal offset one line down. Since your spatial locality is better with a tiled framebuffer, your working set fits in your cache better.
Despite this, X's framebuffer cores use linear modes, even if the framebuffer appears to be tiled from the GPU's perspective. There may be a performance benefit to making the system framebuffer shadow match the GPU's tile layout. The wfb software renderer is designed to allow this, but no (open) driver is seriously using it at the moment.
In general, framebuffer reads absolutely kill performance. Any XAA replacement should do as much work as possible in the write direction only. For the cases where framebuffer reads are unavoidable, the new acceleration architecture should make it possible to use DMA to transfer data out of the framebuffer. EXA has hooks for DMA support.
Even when DMA is unavailable, it is usually more performant to tranfer large blocks of data in and out of framebuffer memory rather than operating on single pixels at a time.
Thrashing can occur when mixing operations that the hardware can accelerate with ops it can't. It remains an open question as to how to best deal with this. EXA/UXA take the attitude that the card can accelerate pretty much anything you throw at it, which seems pretty reasonable.
EXA's Render acceleration is adequate, but lacks support for a few things. Source-only pictures (solids and gradients) are currently not accelerated in hardware. Source IN Mask OP Dest combination could be implemented in multiple passes for cards that only have one texture image unit. External alpha is basically unaccelerated.
Trapezoid rasterisation in Render is not hardware accelerated. It's not even clear that it can be. The software implementation has been reasonably well tuned, but could certainly be better.
TODO: Fill me in.
The observation about tiling for 2D also applies to Mesa's software rasteriser.
Because the X server is single-threaded, any operation in the server that takes a significant amount of time to complete will make the server feel laggy. This is common for the Mesa software renderer and the software Render code, but any part of the server could trigger this in theory. While we should work to maintain fast execution of all code paths, there may be significant benefit to reworking the server to be multithreaded.
One of the worst performance issues X has is making opaque resizes fast. Since the window manager is in a separate process from the application, there are two round trip cycles involved, which makes the latency issues described above worse. There are several possibilities for working around this. One is to move responsibility for window decorations into the client. Another would be to load some portion of the window manager in-process with the X server.
Most X drivers do not synchronize their drawing to the vertical retrace signal from the monitor. (To be fair, very few windowing systems do this consistently, even MacOS X.) This leads to a tearing appearance on some drawing operations, which looks slow. If the vertical retrace signal could be exposed through the SYNC extension, applications could defer their rendering slightly and reduce or eliminate tearing. This requires extending each driver to support this, as well as adding a little support code to the server itself.
The un-Composited model of X operation requires many round trip operations to redraw areas when they are exposed (window move, etc.). It is important that X be able to make Composited operation fast in the future.
On x86 hardware, MTRRs are used to specify the memory access policy for ranges of memory. Setting the framebuffer to write-combining has a significant performance benefit. However, on PCIE systems, there are more memory ranges than MTRRs. PAT stands for Page Attribute Table, which allows you to specify the access policy for individual pages at a time.
Other platforms may have similar memory access mechanisms.
As mentioned above, OS-specific synchronization primitives could have significant performance benefit for shared memory transports. These include futexes on Linux and possibly doors on Solaris.
High performance graphics increasingly requires some kernel support for synchronization and security reasons. The DRM provides this support for Linux and BSD systems, but it could reasonably be ported to other suitable platforms like Darwin and Solaris.