2010-04-07

CSS Selectors And XPath Expressions

In this article I will try to show XPath expression counterparts for CSS selectors and explain some of the major differences. Most web developers are familiar with CSS selectors from creating webpages or using Javascript libraries like jQuery.

"Selectors are patterns that match against elements in a tree, and as such form one of several technologies that can be used to select nodes in an XML document. Selectors have been optimized for use with HTML and XML, and are designed to be usable in performance-critical code." - The W3C about CSS 3 selectors:

Here are two good reasons for using CSS selectors and not XPath expressions in the browser. You already know them from writing CSS files and XPath does not work in all browsers. IE supports it only on documents returned from a XHR or generated by XSLT in the browser.

On the server side the situation is different. In PHP XPath 1 is already implemented in the DOM extension. You don't need an additional library to use it. XPath is more specific and flexible. It is not only for selecting elements, but all kind of nodes and values. So how can you convert a CSS selector into an XPath expression?

A little sample HTML:

<html>
  <head></head>
  <body>
    <ul>
      <li class="first">Item One</li>
      <li>Item Two
        <ol>
          <li class="first last">SubItem One</li>
        </ol>
      </li>
      <li class="last">Item Three</li>
    </ul>
  </body>
</html>

Specific ancestor relationships

Let's say we need all li inside ul (but not ol). In CSS this whould be like "ul>li". The > is important. It defines that the li has to be a child of the ul. A simple "ul li" whould match the "SubItem One", too.

The XPath whould be "//ul/li". If the first char of an XPath expression is a slash the context of the expression is always the document. "/html" whould be the root element in the example. The double slash defines that any elements are allowed between the context and the match. The third slash defines the parent child relationship between ul and li. The expression "//ul//li" whould match the "SubItem One", too.

Multiple selectors/expressions

Both CSS selectors and XPath allow multiple selectors/expressions. You can use "," in CSS and "|" in XPath. Let's say we like to match all lists. In CSS selectors this whould be "ul,ol". XPath whould be "//ul|//ol". XPath always return unique nodes in document order, CSS selectors depend on the implementation.

Context

In CSS files context is not important, all selectors are defined for the document. But the context handling is one of the differences between CSS selectors and XPath.

CSS selectors match the current element like it's descendants. XPath has a specific handling for the current element. A dot can represent the current element in a XPath expression. In many cases the dot is optional. The expression "./li" does the same like "li" and matches only children of the current element. The CSS selector "li" whould match the current element if it would be a li and any li that has the current element as an ancestor. Translated to XPath this whould be "name()='li'|.//li". The XPath function name() allows to compare the tag name with a string.

Matching classes

CSS selector have special syntax for classes and token lists. A selector for a class is a dot followed by the class name. ".first" whould match two li elements from the sample html. The attribute selector "[class~=first]" whould do the same. Unfortunately XPath does not have a special syntax for token lists, but it can be emulated using XPath functions.

  1. Normalize the whitespaces in the class attribute to single spaces: normalize-space(@class)
  2. Add single spaces to the begin and end of the normalized attribute: concat(' ', normalize-space(@class), ' ')
  3. Check if the result contains the class name: contains(concat(' ', normalize-space(@class), ' '), ' first ')
  4. Select all nodes matching the condition: //*[contains(concat(' ', normalize-space(@class), ' '), ' first ')]

Namespaces

XPath has no default namespace, each selector without a namespaces matches only on elements that have no namespace. Namespace and tag name are separated with a colon ("html:div"). The * matches any element node in any namespace, but "*:div" is not possible.

CSS selectors have an universial selector, that can be an empty string or an asterisk. Namespace and tag name are separated using a pipe ("html|div"). "*|div" will match div tags in any namespace.

Overview: Short variants for namespace and element matches
Namespace Element XPath CSS
any any * empty string
none "div" div none
"html" "div" html:div html|div
any "div" *[local-name()='div'] div

Axes

Axes are a feature of XPath and not available in CSS selectors. The most selectors work like the "descendant-or-self" axis. This contains the current element and all children, children of the children and so on. The only exception are the sibling combinators.

The default axis in XPath is "child", which contains all children of the current element. A lot of the axes have a short syntax "." is the "self" axis, ".." the "parent" axis. The axis defines which group of nodes are matched.

To match all li in all namespaces the CSS selector whould be "li". In XPath we can use the "descendant-or-self" axis to simulate this. Because of the namespace the result whould be:

descendant-or-self::*[local-name() = 'li']

This can be combined with the class name check:

descendant-or-self::*[local-name() = 'li' and contains(concat(' ', normalize-space(@class), ' '), ' first ')]

Now I've shown that the CSS selector "li.first" is a lot easier then it's XPath expression counterpart :-). But this is the one to one translation of the CSS selector. It's something an automatic converter should generate. The namespace problem is not relevant, because if you load HTML, PHP ignores them and if you load XML you want to use them. An element in one namespace is not the same like in another - even if they have the same local name.

The token list matching (for classes) is something I need from time to time. But XML formats don't have class or similar attributes often.

I think any CSS selector can be converted into a XPath expression. The result might be large and/or complex but it should work.

Why use XPath?

Actually you can do a lot of things with XPath expressions that are not possible with CSS selectors and whould need you to program additional application logic. You can select text nodes, attributes or aggregate values directly.

But this article is already long enough so I will write more about that in another one.