Development/Documentation/Performance

You are not allowed to use this action.

Clear message

X performance can be difficult to quantify, and there is a large amount of work to be done in this space. This page attempts to give some context for where we are, what to look for, and how to measure. Numbers will be given for various examples for the sake of illustration, but please consider the target environment you're actually facing instead of using these numbers directly.

Discussion and questions about work in these areas should be held on the devel mailing list.

Contents

  1. Transport Performance
    1. Latency
    2. Throughput
    3. Over the Network
    4. On the Local Machine
  2. Driver Performance
    1. 2D Rendering
      1. The Important Bits
      2. XAA
      3. EXA/UXA
      4. Framebuffer Layout
      5. Framebuffer Access
      6. Algorithmic Issues
    2. 3D Rendering
      1. DRI Drivers
      2. Mesa Core
  3. Interactive Performance
  4. Perceptual Performance
  5. Platform and Operating System Support

Transport Performance

Latency

X is a fairly compact protocol. The dominating factor in many aspects of X performance is transport latency - how long it takes for a response to make a round trip. x11perf includes a test for the QueryPointer request, which can tell you roughly your peak latency performance:

ergine:~% DISPLAY=:0 x11perf -pointer
x11perf - X11 performance program, version 1.2
Fedora Project server version 11301000 on :0
from ergine
Wed Feb  6 17:00:19 2013

Sync time adjustment is 0.0251 msecs.

 300000 reps @   0.0174 msec ( 57600.0/sec): QueryPointer
 300000 reps @   0.0144 msec ( 69600.0/sec): QueryPointer
 300000 reps @   0.0175 msec ( 57200.0/sec): QueryPointer
 300000 reps @   0.0178 msec ( 56200.0/sec): QueryPointer
 300000 reps @   0.0147 msec ( 68200.0/sec): QueryPointer
1500000 trep @   0.0163 msec ( 61200.0/sec): QueryPointer

On this machine, 60 synchronous request/reply pairs seems to take about a millisecond. One frame interval is approximately 16ms (for a 60Hz display), so another way to look at it is that you get 960 round trips per frame. The XSync library call in libX11 is implemented using QueryPointer precisely because it's a tiny request with a tiny cheap reply, so you can also think of this as an upper bound on the number of times you can sync with the server per frame.

To mitigate this, not all requests generate replies. One trick that enables this is that resource ID numbers (called XIDs) are assigned to clients in ranges, one each, and thus XID allocation happens on the client side. In a request like CreatePixmap, the client simply picks the next unused XID in its range and effectively tells the server "please make me a pixmap with these properties and give it this XID for a name; remember that, so I don't have to tell you again". No reply is generated if the request succeeds, so the client is free to immediately enqueue another request (say, to paint the pixmap with some color).

Though the protocol is fundamentally asynchronous, the classic client library, libX11, presents a more-or-less synchronous API. In general, for API calls that directly map to reply-free requests, the request is queued in the library and periodically flushed to the server when the queue fills up. For API that maps to requests with replies, the request is enqueued, but then (since the protocol is in-order) the buffer is flushed and the library waits for the reply. This can present a hidden performance trap if one of the requests already enqueued is computationally expensive for the server.

Another potential hidden trap for long-lived applications centers around XID reuse. The client does not keep a record of every XID it has in use, since that would be a rather large bitmap, particularly on the single-digit-megabyte machines common in the early days of X. Instead it simply keeps track of the last one allocated. After the highest-numbered XID has been allocated, libX11 internally generates a request to the server that queries its internal resource database for that client and replies with a range of unused XIDs. This request is in the XC-MISC extension; prior to this extension, long-lived clients would simply abort when they ran out of IDs. This request can be very expensive if many of the client's IDs are already in use, and - since it's a round-trip - can manifest as stalls in handling requests that normally don't block.

XCB is an effort to rearchitect the network layer of X to hide latency by providing an asynchronous interface to the protocol. It is essentially stable, and libX11 is now internally implemented in terms of XCB, so with some care both can be used within the same application. It can provide some dramatic speedups for request patterns that would be very round-trippy if done in libX11; compare, for example, the pre- and post-XCB versions of xlsclients and xlsatoms.

Throughput

Not all X requests are pleasantly small. Image transfer operations like PutImage and glyph upload are done uncompressed, for example. The (no longer implemented) XIE extension did expose some compressed image transfer formats like JPEG, though it did not see wide usage in apps and toolkits. It may or may not be worth reviving compressed image transfers; the bandwidth savings must be measured against moving the CPU cost into the X server, which is a contended resource since it is single-threaded.

Over the Network

There are many tools available that can improve, or at least alter, the performance of X over a TCP connection.

All network graphics protocols have bandwidth as a constraining factor. Beyond this, X11 is somewhat constrained by the guarantees of the protocol itself. Requests must be processed in-order, which means it's not possible to drop intermediate frames to "catch up" with a slow or lossy network. Likewise there's only so much that a proxy like LBX or DXPC can do to bend the rules on event delivery.

It's also important to remember that X isn't just a display system, it's a full IPC suite. Part of the reason xpra can have better performance than a normal networked X session is that the thing it sends over the network is not X, it's more like VNC where you just have pixels streaming in one direction and input events in the other, and are free to drop frames if needed. The downside is that the forwarded application can't interact with other applications on the display where it's being viewed without additional knowledge being built into the forwarding protocol (drag and drop, cut and paste, etc.).

On the Local Machine

Several commercial X servers have used shared memory transports on the local machine to improve performance. Rik Faith researched shared memory transports for the DRI project several years ago. The conclusion then was that it would improve performance for those operations where the time to render the request was dominated by the transport latency, and then by less than 10%.

This may not be true anymore. The balance of the typical machine's memory architecture has shifted, and many operating systems provide advanced high performance synchronization primitives (like futexes on Linux) that may address some of the sync overhead he experienced. This would be an excellent research project.

Driver Performance

2D Rendering

The Important Bits

Most of the core X protocol's rendering routines simply do not get used very often. This is not so much because they are slow, but because they aren't useful on the modern desktop. Empirically, better than 90% of the drawing operations that X sees today are solid fills, blits, and Render operations. Attempting to accelerate 2D operation outside this set is very likely a waste of effort.

XAA

XAA is largely inadequate for accelerating modern desktop usage.

EXA/UXA

XAA is really not worth fixing. The better approach is to start from the lessons learned from KAA, the kdrive acceleration architecture, and port drivers over to that (leaving XAA in place for old or unmaintained drivers).

As of about Xorg 6.8.99.14, there is a new acceleration architecture called EXA that achieves this. EXA is derived from the KAA code, but has been ported to the loadable server design and includes some additional features for improved performance. UXA is a variation on the EXA theme that assumes a unified memory architecture and kernel memory management support.

The ExaStatus page contains the current driver support status.

EXA continues to be tuned for performance. In particular, the pixmap scoring and migration algorithm is still fairly naive. The lessons learned from tuning EXA will apply to Xgl servers as well in the future.

Framebuffer Layout

Most modern graphics cards can be run in either linear or tiled framebuffer modes. Linear modes are simple, you start in the top-left corner and move to the bottom-right, all the way across a single row before changing rows. In tiled modes the framebuffer is broken up into a series of small tiles, usually 8x8 or so, and memory is laid out such that the first 64 pixels belong to the first tile, then the next 64 to the second tile, etc. You can think of linear framebuffer being a tiled framebuffer where each tile is 1x1.

Tiled framebuffers have a performance benefit because they better model the layout of objects on the screen. They give better locality of reference because each tile is packed tightly in memory, where in a linear framebuffer you might have to skip a thousand pixels ahead to get to the same horizontal offset one line down. Since your spatial locality is better with a tiled framebuffer, your working set fits in your cache better.

Despite this, X's framebuffer cores use linear modes, even if the framebuffer appears to be tiled from the GPU's perspective. There may be a performance benefit to making the system framebuffer shadow match the GPU's tile layout. The wfb software renderer is designed to allow this, but no (open) driver is seriously using it at the moment.

Framebuffer Access

In general, framebuffer reads absolutely kill performance. Any XAA replacement should do as much work as possible in the write direction only. For the cases where framebuffer reads are unavoidable, the new acceleration architecture should make it possible to use DMA to transfer data out of the framebuffer. EXA has hooks for DMA support.

Even when DMA is unavailable, it is usually more performant to tranfer large blocks of data in and out of framebuffer memory rather than operating on single pixels at a time.

Thrashing can occur when mixing operations that the hardware can accelerate with ops it can't. It remains an open question as to how to best deal with this. EXA/UXA take the attitude that the card can accelerate pretty much anything you throw at it, which seems pretty reasonable.

Algorithmic Issues

EXA's Render acceleration is adequate, but lacks support for a few things. Source-only pictures (solids and gradients) are currently not accelerated in hardware. Source IN Mask OP Dest combination could be implemented in multiple passes for cards that only have one texture image unit. External alpha is basically unaccelerated.

Trapezoid rasterisation in Render is not hardware accelerated. It's not even clear that it can be. The software implementation has been reasonably well tuned, but could certainly be better.

3D Rendering

DRI Drivers

TODO: Fill me in.

Mesa Core

The observation about tiling for 2D also applies to Mesa's software rasteriser.

Interactive Performance

Because the X server is single-threaded, any operation in the server that takes a significant amount of time to complete will make the server feel laggy. This is common for the Mesa software renderer and the software Render code, but any part of the server could trigger this in theory. While we should work to maintain fast execution of all code paths, there may be significant benefit to reworking the server to be multithreaded.

One of the worst performance issues X has is making opaque resizes fast. Since the window manager is in a separate process from the application, there are two round trip cycles involved, which makes the latency issues described above worse. There are several possibilities for working around this. One is to move responsibility for window decorations into the client. Another would be to load some portion of the window manager in-process with the X server.

Perceptual Performance

Most X drivers do not synchronize their drawing to the vertical retrace signal from the monitor. (To be fair, very few windowing systems do this consistently, even MacOS X.) This leads to a tearing appearance on some drawing operations, which looks slow. If the vertical retrace signal could be exposed through the SYNC extension, applications could defer their rendering slightly and reduce or eliminate tearing. This requires extending each driver to support this, as well as adding a little support code to the server itself.

The un-Composited model of X operation requires many round trip operations to redraw areas when they are exposed (window move, etc.). It is important that X be able to make Composited operation fast in the future.

Platform and Operating System Support

As mentioned above, OS-specific synchronization primitives could have significant performance benefit for shared memory transports. These include futexes on Linux and possibly doors on Solaris.

High performance graphics increasingly requires some kernel support for synchronization and security reasons. The DRM provides this support for Linux and BSD systems, but it could reasonably be ported to other suitable platforms like Darwin and Solaris.