2014-08-17

FluentDOM + HTML5

HTML 5 is not directly supported by PHPs DOM extension. That means FluentDOM can not understand it, too. But here is a solution. HTML5-PHP is library that can parse HTML5 into a DOM document.

Both libraries use Composer:
"require": {
  "fluentdom/fluentdom": "5.*",
  "masterminds/html5": "2.*"
}

Read HTML5 into FluentDOM:
$html5 = new Masterminds\HTML5();
$fd = FluentDOM($html5->loadHTML($html));

Or write it:
echo $html5->saveHTML($fd->document);

HTML5-PHP puts the elements into the XHTML namespace. To use XPath expressions, you will need to register a prefix for it:
$html5 = new Masterminds\HTML5();
$fd = FluentDOM($html5->loadHTML($html));
$fd->registerNamespace(
  'xhtml', 'http://www.w3.org/1999/xhtml'
);
echo $fd->find('//xhtml:p')->text();

2014-08-09

Xpath 1.0 - Quoting Strings

Strings in Xpath 1.0 can be enclosed in single or double quotes. The following expressions are equivalent.

//div[@id = 'foo']
//div[@id = "foo"]


This is nice because you can use the variant that requires less or none escaping. I prefer single quotes for PHP because they need less escaping (only single quote and backslash). I usually end up with something like this:

$xpath->evaluate('//div[@id = "foo"]');

However a problem comes up if the value is dynamic.

$xpath->evaluate('//div[@id = "'.$_GET['foo'].'"]');

If $_GET['foo'] contains a double quote, it will break the expression. It is compare able to an SQL-Injection and should be avoided, don't you think?

The Xpath 1.0 specification for a literal is:

Literal   ::=   '"' [^"]* '"'


| "'" [^']* "'"

It disallows the use of the enclosing quote in the literal itself, here is no way to escape it.

Hint: This is different in Xpath 2.0. You can duplicate the quotes to escape them.

Deciding Which Quote To Use

 The first and obvious step is to check the value for quotes and use the one that it does not contain:

function quote($value) {
  $char = strpos($value, '"') === FALSE ? '"' : "'";
  return $char.$value.$char;
}


But a value could contain both kind of quotes. This would still break the expression.

Divide And Conquer

If is not possible to quote the whole value because it contains both kind of quotes you need to divide it into parts that can be quoted. You can then use the Xpath function concat() to rebuild the orignal value again:

//div[. = concat("Singe Quote: '", 'Double Quote: "')]

Matching text structures is the domain of regular expression. So lets use them:

preg_match_all('("[^\']*|[^"]+)', 'Double Quote ", Single Quote \'', $matches);
var_dump($matches);


Output:

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(13) "Double Quote "
    [1]=>
    string(16) "", Single Quote "
    [2]=>
    string(1) "'"
  }
}

The pattern matches any string that start with a double quote and contains no single quote or any string that does not contain any double quote.

All that is left is quoting the parts and join them back together:

foreach ($matches[0] as $part) {
  $quoteChar = (substr($part, 0, 1) == '"') ? "'" : '"';
  $result .= ", ".$quoteChar.$part.$quoteChar;
}
return 'concat('.substr($result, 2).')';

Put Together

It does not make sense to create the function call for a single argument. So the check is still needed:

  1. If the value contains no single quote, use single quotes
  2. If the value contains no double quote, use the double quotes
  3. Otherwise divide the string and use concat()

A complete implementation can be found in FluentDOM\Xpath::quote().

2014-08-06

FluentDOM 5 + XML Namespaces

FluentDOM 5 allows to register namespaces on the DOM document class. These are not the namespaces of a loaded document, but a definition of namespaces for your programming logic.

Namespaces In An XML File

People mistake the namespace prefixes for the namespaces often. The namespace definitions are the xmlns:* Attributes. Let's take a really simple example:

<atom:feed xmlns:atom="http://www.w3.org/2005/Atom"/>

This is a feed element in the Atom namespace. The actual namespace is "http://www.w3.org/2005/Atom". Because this would be difficult to read and write the prefix/alias 'atom' is defined for the namespace. 

You can read the element name as '{http://www.w3.org/2005/Atom}:feed'

It is possible to define a default namespace for elements in an XML. 

<feed xmlns="http://www.w3.org/2005/Atom"/> 

This should still be interpreted as '{http://www.w3.org/2005/Atom}:feed'. The namespace prefix in the source document is not relevant. You need a way to match the namespace itself and not the alias.

Hint 1: Namespace prefixes can be redefined on any element node in the document.

Hint 2: Attributes always need a prefix to use a namespace. Any attribute without a prefix is in the "none/empty" namespace.

Namespaces in XPath

An XPath expresssion that could match the Atom feed element would need to use a prefix and needs a way to resolve that prefix into the actual namespace. 

A PHP Example:

$xpath = new DOMXpath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$nodes = $xpath->evaluate('/a:feed');


In PHP the DOMXpath class provides the evaluate() method and the namespace resolver. You register your own namespace prefixes. '/a:feed' gets resolved to '/{http://www.w3.org/2005/Atom}:feed'. The prefixes in the document and the expression can be different, but the resolved namespace has to be the same.

A JavaScript example:

var xmlns = {
  namespaces : {
    'a' : 'http://www.w3.org/2005/Atom'
  },
  lookupNamespaceURI : function(prefix) {
    return this.namespaces[prefix] || null;
  }
};
var nodes = xmlDocument.evaluate('/a:feed', xmlDocument, xmlns, XPathResult.ANY_TYPE, null); 


In JavaScript the document object provides the evaluate() method and the namespace resolver is an argument for it. This will work in most of the current browsers, but not IE. 

Namespaces in FluentDOM

In FluentDOM 5 the document and the element classes both have an evaluate() method. To define your namespace mapping you register the namespace on the document object:

$document->registerNamespace('a', 'http://www.w3.org/2005/Atom')
$nodes = $document->evaluate('/a:feed');


Now because the document has a way to resolve namespaces, it can do this for other methods, too. Basically for all methods a have a namespace aware variant, like 'createElement()' and 'createElementNS()' or 'setAttribute()' and 'setAttributeNS()'. If FluentDOM can resolve a tagname it will call the namespace aware variant of the method.

// Standard DOM
$feed = $document->createElementNS(
  'http://www.w3.org/2005/Atom', 'atom:feed'
);
// FluentDOM
$document->registerNamespace('atom', 'http://www.w3.org/2005/Atom');
$feed = $document->createElement('atom:feed');

Default Namespace 

FluentDOM adds the concept of a default element namespace, too.

// Standard DOM
$feed = $document->createElementNS(
  'http://www.w3.org/2005/Atom', 'feed'
);
// FluentDOM
$document->registerNamespace('#default', 'http://www.w3.org/2005/Atom');
$feed = $document->createElement('feed');

appendElement()


FluentDOM\Document and FluentDOM\Element implement an appendElement() method. It is a shortcut for serveral methods. It creates the element, adds attributes and a text node and appends it to the parent node. Together with the namespace handling, creating an XML document becomes a lot simpler. You can find an example in the wiki.