2010-04-10

Using PHP DOM With XPath

Often I hear people say "We use SimpleXML, because DOM is so noisy and complex". Well, I don't think so. This article explains how you can parse a XML (an Atom feed) using the PHP DOM extension. No other libraries are involved.

Load the feed

To load the feed, you need to create an new DOMDocument document using it's load() method. This works with the PHP stream wrappers, so you can load local files or urls. DOMdocument, has dedicated methos for XML strings and HTML files and strings, too.

$feed = new DOMDocument();
$feed->load('http://www.a-basketful-of-papayas.net/feeds/posts/default');
...

If here is any problem with the resource, PHP will output error messages. You can use libxml_use_internal_errors() to block them. With libxml_clear_errors() the internal error list is cleared, libxml_get_errors() returns them so you could implemented you own error handling. Just ignore them for now:

$errorSetting = libxml_use_internal_errors(TRUE);
$feed = new DOMDocument();
$feed->load('http://www.a-basketful-of-papayas.net/feeds/posts/default');
libxml_clear_errors();
libxml_use_internal_errors($errorSetting);
...

In the next step you should check if you got some content. I use the documentElement property for this. If it is not here, the feed has to be invalid because any XML needs at least one element node.

if (isset($feed->documentElement)) {
  ...
} else {
  echo 'Invalid feed.';
}

Initialize XPath

Now a XPath object is needed to execute expressions. Atom feeds make use of namespaces, often declaring the atom namespace as default. But in XPath you have no default namespace, you need to register the namespace with an arbitrary prefix. It does not have to be the same prefix used in the XML file. It can't for the default namespace obviously because it has no prefix in the XML file.

...
$xpath = new DOMXPath($feed);
$xpath->registerNamespace('atom', 'http://www.w3.org/2005/Atom');
...

If you load HTML into the DOMDocument using the special methods, all namespaces are ignored. You can skip the registration in this case.

Executing XPath Expressions

The DOMXPath object has two methods for executing xpath expressions. One is query(), it always returns a DOMNodelist. You should use the second one: evaluate(). It will return DOMNodelist objects by default, but depending on the expression it can return other types, too. With evaluate() you have direct access to the title text, it will return an empty string if the feed has no title.

The code selects the element nodes in the registered namespace and casts them to string.

...
echo $xpath->evaluate('string(/atom:feed/atom:title)'), "\n";
echo $xpath->evaluate('string(/atom:feed/atom:subtitle)'), "\n";
...

Next we will loop over all entries. A DOMNodelist works with foreach, the expression will return an empty list if it does not match, so no additional checking is needed. Inside the loop the entry node is used as a context argument for evaluate().

...
foreach ($xpath->evaluate('//atom:entry') as $entryNode) {
  echo $xpath->evaluate('string(atom:title)', $entryNode), "\n";
  echo $xpath->evaluate(
    'string(atom:link[@rel="alternate" and @type="text/html"][1]/@href)',
    $entryNode
    ), "\n";
  echo "\n";
}
...

Conditions

XPath expression can be conditions. It can be used to check if a entry has categories (tags). The return value of the following expression is a boolean value.

... if ($xpath->evaluate('count(atom:category) > 0', $entryNode)) { ... } ...

Loop over attributes

Each entry can have several categories. The title of the category is in it's attribute "term". You can select these attributes directly into a list.

echo 'Categories: ';
foreach ($xpath->evaluate('atom:category/@term', $entryNode) as $index => $categoryAttribute) {
  if ($index > 0) {
    echo ', ';
  }
  echo $categoryAttribute->value;
}
echo "\n";

Complete Example

Here is the full script. Be aware that it outputs text. If you execute it using a webserver (and not the command line), you should add a header('Content-Type: text/plain') to the top.

<?php
$errorSetting = libxml_use_internal_errors(TRUE);
$feed = new DOMDocument();
$feed->load('http://www.a-basketful-of-papayas.net/feeds/posts/default');
libxml_clear_errors();
libxml_use_internal_errors($errorSetting);

if (isset($feed->documentElement)) {
  $xpath = new DOMXPath($feed);
  $xpath->registerNamespace('atom', 'http://www.w3.org/2005/Atom');
  echo $xpath->evaluate('string(/atom:feed/atom:title)'), "\n";
  echo $xpath->evaluate('string(/atom:feed/atom:subtitle)'), "\n";
  echo str_repeat('*', 72), "\n\n";
  foreach ($xpath->evaluate('//atom:entry') as $entryNode) {
    echo $xpath->evaluate('string(atom:title)', $entryNode), "\n";
    if ($xpath->evaluate('count(atom:category) > 0', $entryNode)) {
      echo 'Categories: ';
      foreach ($xpath->evaluate('atom:category/@term', $entryNode) as $index => $categoryAttribute) {
        if ($index > 0) {
          echo ', ';
        }
        echo $categoryAttribute->value;
      }
      echo "\n";
    }
    echo $xpath->evaluate(
      'string(atom:link[@rel="alternate" and @type="text/html"][1]/@href)',
      $entryNode
    ), "\n";
    echo "\n";
  }
} else {
  echo 'Invalid feed.';
}
?>

I hope, I could show you that DOM is really comfortable if you're using XPath. If you want it easier, try FluentDOM. It combines the power and comfort of XPath with the jQuery fluent interface.

5 comments:

  1. XQuery can be a really cool way to do XML processing with PHP: http://www.ibm.com/developerworks/xml/library/x-zorba/index.html

    ReplyDelete
  2. I was using the validation functions of dom a short time ago and i was amazed how complete it was, Maybe i smoked too much javascript over the last months, but the result was pretty clean and it felt right to do it all in dom.
    After a some tries i even found ways not to end up with a piece of code only a fast refactoringsurgery could save ;))) Dom is not what you want to do all day long, but it's mature enough for 99% of the XML solutions i needed to create ;)

    Sebs

    ReplyDelete
  3. Another method is to use QueryPath. One line to load and simple to iterate over. (http://querypath.org/)

    I respect doing it with DOM objects but I prefer to write less code.

    ReplyDelete
  4. Thomas,

    Great post. I too find that for most tasks, I like using DOM over SimpleXML. For me, I prefer to use one interface versus switching between DOM and SimpleXML. DOM can do everything that SimpleXML can do and more. For instance, I can't use SimpleXML to validate against an XML schema document.

    BTW, I may have to check out your FluentDOM. Looks interesting.

    ReplyDelete

x