X Window System Concepts

Alan Coopersmith

This chapter aims to introduce you to the basic X Window System concepts and terminology you will need to understand. When you have these concepts, you will be ready to dive deeper into specific topics in later chapters.

X Is Client / Server

The X Window System was designed to allow multiple programs to share access to a common set of hardware. This hardware includes both input devices such as mice and keyboards, and output devices: video adapters and the monitors connected to them. A single process was designated to be the controller of the hardware, multiplexing access to the applications. This controller process is called the X server, as it provides the services of the hardware devices to the client applications. In essence, the service the Xserver provides is access, through the keyboard, mouse and display, to the X user.

Like many client/server systems, the X server typically provides its service to many simultaneous clients. The X server runs longer than most of the clients do, and listens for incoming connections from new clients.

Many users will only ever use X on a standalone laptop or desktop system. In this setting, the X clients run on the same computer as the X server. However, X defines a stream protocol for clients / server communication. This protocol can be exposed over a network to allow clients to connect to a server on a different machine. Unfortunately, in this model, the client/server labeling can be confusing. You may have an X server running on the laptop in front of you, displaying graphics generated by an X client running on a powerful machine in a remote machine room. For most other protocols, the laptop would be a client of file sharing, http or similar services on the powerful remote machine. In such cases, it is important to remind yourself that keyboards and mice connect to the X server. It is also the one endpoint to which all the clients (terminal windows, web browsers, document editors) connect.

X In Practice

This section describes some of the fundamental pieces of X and how they work. This is one of those places where everything wants to be presented at once, so the section is something of a mish-mash. Recommended reading practice is to skim it all once, and then go back and read it all again.


As mentioned earlier, the X server primarily handles two kinds of hardware: input devices and output devices. Surprisingly, the input handling tends to be the more difficult and complicated of the two. Input is multi-source, concurrent, and highly dependent on complex user preferences.

Input via Keyboard

One of the tasks the X server performs is handling typing on keyboards and sending the corresponding key events to the appropriate client applications. In a simple X configuration, one client at a time has the "input focus" and most key events will go to that client. Depending on window manager configuration, focus may be moved to another window by simply moving the mouse to another window, clicking the mouse, using a hotkey, or by manipulating a panel showing available clients. The client with focus is usually highlighted in some way, so that the user can know where their input will go. Clients may use "grabs" (described later in this chapter) to override the default delivery of key events to the focused client.

There are a wide variety of keyboards in the world. This is due to differing language requirements, to differing national standards, and to hardware vendors trying to differentiate their product. This variety makes the mapping of key events from hardware "key codes" into text input a challenging and complex process. The X server reports a simple 8-bit keycode in key press and release events. The server also provides a keyboard mapping from those keycodes to "KeySyms" representing symbolic labels on keys ("A", "Enter", "Shift", etc.). Keycodes have no inherent meaning outside a given session; the same key may generate different code values on different keyboards, servers, configurations, or operating systems. KeySym values are globally-assigned constants, and are thus what most applications should be concerned with. The X Keyboard (XKB) extension provides complex configuration and layout handling, as well as additional key handling functionality that was missing in the original protocol. Xlib and toolkits also provide input methods for higher level input functions, such as compose key handling or mapping key sequences to complex characters (for example, Asian language input).

Input via Mouse

The X protocol defines an input "pointer" (no relation to the programming concept). The pointer is represented on screen by a cursor; it is usually controlled by a mouse or similar input device. Applications can control the cursor image. The core protocol contains simple 2-color cursor image support. The Render extension provides alpha-blended 32-bit color cursor support; this support is normally accessed through libXcursor.

Pointer devices report motion events and button press and release events to clients. The default configuration of the Xorg server has a single pointer. This pointer aggregates motion and button events from all pointer-type devices attached to the server: for example, a laptop's touchpad and external USB mouse. Users can use the MultiPointer X (MPX) functionality in Xinput extension 2.0 to enable multiple cursors and assign devices to each one. With MPX, each pointer has its own input focus. Each pointer is paired with keyboards that provide input to the client that has the input focus for that pointer.

Input via Touchpad

For basic input, a touchpad appears to clients as just another device for moving the pointer and generating button events. Clients who want to go beyond mouse emulation can use the Xinput extension version 2.2 (shipped with Xorg 1.12) or later to enable support for multitouch event reporting.

Input via Touchscreen

[XXX write me --po8]

Advanced Input Devices and Techniques

[Make whot write this? or steal from http://who-t.blogspot.com? --alanc]

GetImage: Reading From the Display

The X server does not keep track of what it has drawn on the display. Once bits are rendered to the frame buffer, its responsibility for them has ended. If bits need to be re-rendered (for example, because they were temporarily obscured), the X server asks a client---usually either a compositing manager or the application that originally drew them---to draw them again.

In some situations, most notably when taking "screenshots", a client needs to read back the contents of the frame buffer directly. The X protocol provides a GetImage request for this case.

GetImage has a number of drawbacks, and should be avoided unless it is absolutely necessary. GetImage is typically extremely slow, since the hardware and software paths in modern graphics are optimized for the case of outputting pixels at the expense of rendering them. GetImage is also hard to use properly. Here, more than anywhere else in the X protocol, the underlying hardware is exposed to clients. The requested frame buffer contents are presented to the client with the frame buffer's alignment, padding and byte ordering. Generic library code is available in Xlib and XCB to deal with the complexity of translating the received frame buffer into something useful. However, using this code further slows processing.


Rendering / Rasterization

The X protocol originally defined a core set of primitive rendering operations, such as line drawing, polygon filling, and copying of image buffers. These did not evolve as graphics hardware and operations expected by modern applications moved on, and are thus now mainly used in legacy applications.

Modern applications use a variety of client side rendering libraries, such as Cairo for rendering 2D images or OpenGL for 3D rendering. These may then push images to the X server for display, or use DRI to bypass the X server and interact directly with local video hardware, taking advantage of GPU acceleration and other hardware features.

Polygon Rendering Model

Displays and Screens

X divides the resources of a machine into Displays and Screens. A Display is typically all the devices connected to a single X server, and displaying a single session for a single user. Systems may have multiple displays, such as multi-seat setups, or even multiple virtual terminals on a system console. Each display has a set of input devices, and one or more Screens associated with it. A screen is a subset of the display across which windows can be displayed or moved - but windows cannot span across multiple screens or move from one screen to another. Input devices can interact with windows on all screens of an X server, such as moving the mouse cursor from one screen to another. Originally each Screen was a single display adaptor with a single monitor attached, but modern technologies have allowed multiple devices to be combined into logical screens or a single device split.

When connecting a client to an X server, you must specify which display to connect to, either via the $DISPLAY environment variable or an application option such as -display or --display. The full DISPLAY syntax is documented in the X(7) man page, but a typical display syntax is: hostname:display.screen The "hostname" may be omitted for local connections, and ".screen" may also be left off to use the default screen, leaving the minimal display specification of :display, such as ":0" for the normal default X server on a machine.

Graphics contexts

A graphics context (GC) is a structure to store shared state and common values for X drawing operations, to avoid having to resend the same parameters with each request. Clients can allocate additional graphics contexts as necessary to be able to specify different values by setting up a separate GC for each set of values and then just specifying the appropriate GC for each operation.

Colors (really?) and Visuals

X is so old that when it was designed most users had monochrome displays, with just black and white pixels to choose from, and even then hardware manufacturers couldn't agree which was 0 and which was 1. Those who spent an extra thousand dollars more would have 4 or 8 bit color, allowing pixels to be chosen from a palette of up to 256 colors. But now it's 2012, and anyone without 32-bits of color data per pixel is a luddite. Still, a lot of complexity remains here that someone should explain...

Syncing and Flushing connections

As described in the Communication chapter, the X protocol tries to avoid latency by doing as much asynchronously as possible. This is especially noticed by new programmers who call rendering functions and then wonder why they got no errors but did not see the expected output appear. Since drawing operations do not require waiting for a response from the X server, they are just placed in the clients outgoing request buffer and not sent to the X server until something causes the buffer to be flushed. The buffer will be automatically flushed when filled, but it takes a lot of commands to fill the default 32kb buffer size in Xlib. Xlib and XCB will flush the buffer when a function is called that blocks waiting for a response from the server (though which functions those are differ between the two due to the different design models - see the Xlib and XCB chapter for details). Lastly, clients can specifically call XFlush() in Xlib or xcb_flush() in XCB to send all the queued requests from the buffer to the server. To both flush the buffer and wait for the X server to finish processing all the requests in the buffer, clients can call XSync() in Xlib or xcb_aux_sync() in XCB.

Window System Objects

A variety of objects are used by X.


In X, a window is simply a region of the screen into which drawing can occur. Windows are placed in a tree hierarchy, with the root window being a server created window that covers the entire screen surface and which lives for the life of the server. All other windows are children of either the root window or another window. The UI elements that most users think of as windows are just one level of the window hierarchy.

At each level of the hierarchy, windows have a stacking order, controlling which portions of windows can be seen when sibling windows overlap each other. Clients can register for Visibility notifications to get an event whenever a window becomes more or less visible than it previously was, which they may use to optimize to only draw the visible portions of the window.

Clients running in traditional X environments will also receive Expose events when a portion of their window is uncovered and needs to be drawn because the X server does not know what contents were there. When the composite extension is active, clients will normally not receive expose events since composite puts the contents of each window in a separate, non-overlapped offscreen buffer, and then combines the visible segments of each window onscreen for display. Since clients cannot control when they will be used in a composited vs. legacy environment, they must still be prepared to handle Expose events on windows when they occur.


A pixmap, like a window, is a region into which drawing can occur. Unlike windows, pixmaps are not part of a hierarchy and are not displayed on screen directly. Pixmap contents may be copied to windows for display, either directly via requests such as CopyArea, or automatically by setting a Window's background to be a given pixmap. Pixmaps may be stored in system memory, video memory on a graphics adaptor, or shared memory accessible by both client and server. A given pixmap may be moved back and forth between system and video memory as needed to maintain a good cache of recently accessed pixmaps in faster access video RAM. Using the MIT-SHM extension to store a pixmap in shared memory may allow the client to push updates faster, by operating directly on the shared memory region instead of having to copy the data through a socket to the server, but it may also prevent the server from moving the pixmap into the cache in video ram, making copies to a window on the screen slower.


Applications need more than windows and pixmaps to provide a user interface - users expect to see menus, buttons, text fields, menus, etc. in their windows. These user interface elements are collectively called widgets in most environments. X does not actually provide any widgets in the core protocol or libraries, only the building blocks such as rendering methods and input events for them to be built with. Toolkits such as Qt and GTK+ provide a common set of widgets for applications to build with, and a rich set of functionality to provide good support for a wide range of uses and users, including those who read different languages or need accessibility technology in order to use your application. Some toolkits have utilized all the infrastructure X provides around window stacking and positioning by making each widget a separate window, but most modern toolkits do this management client side now instead of pushing it to the X server.


Many resources managed by the server are assigned a 32-bit identification number, called an XID, from a server-wide namespace. Each client is assigned a range of identifiers when it first connects to the X server, and whenever it sends a request to create a new Window, Pixmap, Cursor or other XID-labeled resource, the client (usually transparently in Xlib or xcb libraries) picks an unused XID from it's range and includes it in the request to the server to identify the object created by this request. This allows further requests operating on the new resource to be sent to the server without having to wait for it to process the creation request and return an identifier assignment. Since the namespace is global to the Xserver, clients can reference XID's from other clients in some contexts, such as moving a window belonging to another client.


In order to reduce the retransmission of common strings in the X protocol, a simple lookup table mechanism is used. Entries in this table are known as Atoms, and have an integer key that is passed in most protocol operations requiring them, and a text string that can be retrieved as needed. The InternAtom operation searches finds the Atom id number for a given string, and can optionally add the string to the table and return a new id if it's not already found. The GetAtomName returns the string for a given atom id number. Atoms are used in a wide variety of requests and events, but have a unique namespace across all operations and clients of a given X server.


A common design pattern in X for providing extensible metadata is the Property mechanism. A property is a key value pair, where the key is a text string, represented as an X atom, and the value is a typed value, which may also be an atom, an integer, or some other type. The core protocol provides properties on windows and fonts. The Xinput extension adds properties to input devices, while the Xrandr extension adds properties to output devices.

X itself does not assign any meaning or purpose to window properties. However conventions have been established for many window properties to provide metadata that is useful for window and session management. The initial set of properties is defined in the X Inter-Client Communication Conventions Manual (ICCCM), which may be found at http://www.x.org/releases/current/doc/. This initial set was later extended by groups working on common functionality for modern desktop environments at freedesktop.org, which became the Extended Window Manager Hints (EWMH) specification, found at http://www.freedesktop.org/wiki/Specifications/wm-spec.


Grabs in X provide locking and reservation capabilities. "Active Grabs" take exclusive control of a given resource immediately and lock out all other clients until the grab is released. "Passive grabs" place a reservation on a resource, causing an active grab to be triggered at a later time, when an event occurs, such as a keypress. These can be used for instance, to have a hotkey that goes to a certain application regardless of which application currently has input focus.

One of the available grabs is the Server Grab. A client who grabs the server locks out all other clients, preventing any other application from being able to update the display or interact with the user until the server grab is released. This should be released as soon as possible, since besides annoying users when they can't switch to another program, it may also cause security problems, since the screen lock is just another client and will be locked out with the rest.

The other primary form of grab is on an input device or event. Clients can actively grab the keyboard or mouse to force getting all input from a device, even if the cursor moves outside the application's window. Passive grabs can be placed on specific input events, such as a particular keypress event or mouse button event, causing a primary grab to automatically occur for that client when the event happens.

More information can be found in http://who-t.blogspot.com/2010/11/high-level-overview-of-grabs.html.

Selections, Cut-Copy-Paste

[copy-and-paste from http://keithp.com/~keithp/talks/selection.ps and other docs on http://www.x.org/wiki/CutAndPaste ? ]